Businesses and firms are being overwhelmed with electronic records. Enterprise search offers a promising way to deal with the growth of electronically stored information. However, not all search technologies are well adapted to serve as the search backbone for an enterprise. While key word searching may help find some documents in document collections, more sophisticated search technologies are called for to assimilate and organize content across the enterprise. We have found that Recommind’s Probabilistic Latent Semantic Analysis (PLSA) search engine is particularly well-suited for law firm and professional services environments, both because of the inherent power of PLSA and because Recommind’s strong and diverse search platform is focused on solving problems inherent in the legal enterprise.
Chasing Paper Away
For decades, businesses have been shifting their work processes-and the records that document those processes-from paper to electronic form. ARMA International, the records management association, estimates that more than 90% of all business records created today are electronic.
Concurrently with this shift from paper to electronic, the volume of records created is increasing as well. Where ten or fifteen years ago many records were themselves formal “documents” managed in a formal and discrete repository (for example, in a document management system), nowadays, informal records are the norm, and they are multiplying in both kind and number. In the case of my own law firm, Bryan Cave LLP, we must account for and manage the correspondence, briefs, pleadings, memoranda, contracts, agreements and other documents for which lawyers are known. Those have shifted to electronic form. But we must now also deal with spreadsheets, databases, image files, emails, attachments, text messages, instant messages, digital voice mails and much more. As a consequence of this proliferation of formal and informal electronic records, we are seeing annual storage volume growth exceeding 50%, and the rate of increase is also increasing.
The Search for Search
In the midst of this digital inundation, another revolution is taking place. Enterprise search technology has finally come of age, and businesses of all sizes are considering how and where to deploy it. Typically, such consideration is in the context of “knowledge management” initiatives. In the case of law firms, the commonplace rationale for acquiring search technology is that it will help find lawyer work product, resulting in more efficient work processes and higher productivity. That efficiency is used to justify the oftentimes significant acquisition costs of search technology.
There is, however, a much more fundamental reason to implement search technology. It promises to become the principal means-and perhaps the only means-by which we can hope to gain control of the rapidly growing body of electronic information that underlies our day-to-day business. And that reality, in turn, dictates the type of search technology that should be acquired. Many enterprise search technologies can help find documents. Some are better than others. But relatively few are positioned to serve as the backbone of an entire enterprise’s information flow, be it formal and isolated or informal and dispersed.
Unfortunately, the recent change to the Federal Rules of Civil Procedure with respect to Electronically Stored Information (ESI)-a change which for the most part merely codifies existing practices– has plunged many law firms and companies into a reactive rather than strategic mode of thought. Suddenly, there exists the perceived immediate necessity to have some level of control over electronically stored information allowing firms and companies to engage in discovery conferences and respond to discovery requests as mandated by the Federal Rules.
This panic is leading decision makers toward rapid, knee-jerk choices based on their immediate needs (i.e., any search technology is better than no search technology), much to the detriment of their long term needs. A better and more strategic approach is to think of search as a fundamental component of the enterprise and to choose search technologies based on long-range needs. Reacting tactically-solving today’s problems-will only result in the necessity to buy new systems over and over again. Firms must evaluate search solutions and their underlying technologies carefully and select those that can meet their requirements for the next five years, or longer. A starting point in such an evaluation is understanding the differences, and the different potential, inherent in various search technologies.
What is Good-Measuring Search Engine Performance
Before exploring the various types of search technologies, it is useful to talk a little about performance measures. What makes a search result good or bad? While there are many possible measures of search performance, ranging from the simple and intuitive to the extraordinarily abstract, two simple measures get at the basics of search technology performance: Precision andRecall. These terms are manipulations of two “sets” of information. Remember set theory from high school math? A set is a collection of distinct objects that, treated together, can be taken as a whole. The sets that are important to assessing search query results are these:
The set of all relevant documents. These are the documents you want to find with your search.
The set of all retrieved documents. These are the documents returned by your search, whether relevant or not.
Precision is the fraction of documents in a query result that are relevant. It is the proportion of retrieved and relevant documents (a set intersection-all retrieved documents that are relevant) to all documents retrieved. A query result with perfect precision would have all returned documents be relevant. Any proportion less than 1:1 indicates a search result with less precision.
Recall is a measure of a search result’s ability to find all relevant documents. It is the fraction of relevant documents retrieved out of all relevant documents. If there are 100 relevant documents and a search retrieves just 20 of them, then the recall of that search is 0.2-not a particularly good result.
Note that precision and recall are at odds with each other in most settings. To achieve perfect recall (and very low precision) a search engine merely has to return all documents in the document repository. All relevant documents will be returned, but at the expense of precision. To achieve perfect precision (with no recall) a system merely has to return zero documents for any query. There will be no irrelevant documents returned.
In most search technologies today, there is an inherent inverse relationship between the two measures. Having tighter search parameters will most likely filter out irrelevant results, thus raising precision, but often has the unfortunate side effect of rejecting relevant results, lowering recall. The quest of recent years has been to find a search technology that achieves both high precision and high recall.
To many, a search engine is a search engine is a search engine. It finds things. But there exist discrete types of search technologies, each with inherent strengths and weaknesses. When choosing which search technology upon which to found your enterprise architecture, it pays to understand the differences. The three discussed here are set-theoretic models, algebraic models, and probabilistic models.
Those of us who grew up on the earliest search engines have come to think of all search engines as operating according to an intuitive brand of set theory. Our intuition works something like this: The documents we want to find contain certain terms, so we compile those terms into a set called a query. We’ll label that as Set A. Inside our enterprise, we have an entire collection of documents containing many, many terms, including some with those terms in Set A. We’ll call that second grouping of documents Set B. The results we want, then, are those documents that contain the correct search terms-the Intersection of Set A and Set B:
Figure 1-The Intersection of Set A and Set B
That is in fact how most searchers think of searching, and with set-theoretic search engines, they’re not far off. In true set model searching there are actually many more operations than shown above, using multiple sets (e.g., the set of all documents with term X, the set of all documents with term Y, and so on), but the basic idea is correct.
In the legal industry, the earliest search technologies widely accessible to researchers were Lexis-Nexis and Westlaw case law search engines. This is exactly how they operated, and many search technologies today (e.g., those using simple database joins) continue to operate in that fashion.
Plain vanilla set-based searching is certainly better than no searching at all, but some problems become apparent almost immediately. A search through a document collection looking for a sales agreement first needs to find all documents with the word sales and all with the word agreement and then return results that contain both. To do that, set-theoretic engines incorporate Boolean logic and, in this case, the Boolean AND operator (sales AND agreement). The problem is that a set-theoretic search done in this way is just as likely to return a document containing the phrases sales pitch and mutual agreement as it is to return those with sales agreement.
Almost immediately after their invention, information retrieval scientists began to modify set-theoretic searching to alleviate its shortcomings. An early modification was to implement proximity as a value in searching. Thus, a search for sales within 5 words of agreement was more likely to return something that actually was a sales agreement itself that the raw union of sale and agreement sets. Such tricks helped, but it was not long before different search technologies emerged that moved searching beyond mere phrase-finding to something more akin to searching for meaning.
Vector Space Search Analysis
Modern, computerized search technology took a great leap forward in the 1960s, when researchers such as Gerald Salton first sought to make information retrieval “smart “(and, in fact, Salton’s first search engine was called S.M.A.R.T.). Most of the search engine products available today, including Verity and Microsoft SharePoint Search are founded on vector space search models pioneered by Salton.
Vector space search technology is based on the linear algebraic principles first described by Descartes. Linear algebra is the math of matrices (think of them as tables). The most fundamental element of vector space searching is a term space. A term space consists of every unique word that appears in a group of documents. It is often expressed on the Y axis of a table (a list of words running down the left side of the table). The second major element of a vector space search engine is the document repository. This is usually represented on the X axis of a term space. Note that at the intersection of terms and documents is a number, the term count, that reflects how many occurrences of each term are in each document.
Figure 3-Term Space Table
By using the term space as a coordinate space, and the term counts as coordinate values within that space, it is possible to create a vector for each document. In order to understand how this is done, let’s look at a simple example. In the case of a term space containing four unique terms, four axes would arise: the term 1, term 2, term 3 and term 4 axes. (In vector space search theory these axes are usually referred to as dimensions.) By determining how many times each term appears in each document, and plotting the coordinates along each term dimension, the search engine can determine a point in the term space that corresponds to the document. This point in vector space is then used to create a vector for the document back to the origin. Once the vector of a document is plotted through the term space, the magnitude of the vector can be calculated. Think of the magnitude as the length of the line between the documents point in the term space and the origin of the term space (at coordinates (0,0,0) in our example). These vector magnitudes allow vector space search engines to compare documents by calculating the cosine of the angle between them. For example, identical documents will have a cosine of 1, documents containing similar terms will have positive decimal cosines, and documents with nothing in common will have a cosine value of zero.
Vector space searching thus delivers one key advantage over set-theoretic search models. It can rank documents according to relevancy. In the much simplified diagram below (vector space models are in fact high dimensional), Document 1 is more relevant to Query 1 than Document 2 is.
Figure 4-Vector Comparisons
Basic vector space searching typically exhibits high recall, but relatively poor precision. This is because a document’s terms are only indifferent predictors of its meaning. To illustrate, let’s assume that we want to explore documents treating chief executives in federal, constitutional systems. The operative term will be “President,” or some root or variant thereof. We will search for that term in a half dozen simple documents, as illustrated in the table below.
Figure 5-Presidential Polysemes
Each of the six columns illustrates a polyseme, a word or phrase with multiple, related meanings. Column 1 associates “President” with the federal government, while column 6 associates “President” with the Mormon Church. Neither set-theoretic nor vector space search technologies can distinguish polysemes based on the semantic meaning of terms. They merely look for occurrences.
There is nothing wrong with search engines based on vector analysis, and for firms and companies with no prior search capability, such technology is a vast improvement. The point of this article, however, is that keyword-finding search technologies are not sufficiently robust to satisfy the demands of foundation-level architecture within the enterprise. What is needed is a search technology that can identify and aggregate results around concepts not keywords. That means moving beyond set- and vector-based systems.
Latent Semantic Analysis
To remedy shortcomings in vector space searching, Bell Labs developed an extension of that technology called Latent Semantic Analysis (LSA). It analyzes groups of words and the frequency with which they occur together. LSA represents the words used in it, and any set of these words-such as a sentence, paragraph, or essay-either taken from the original sources or new sources (a search string), as points in a very high dimensional “semantic space”. The key idea is to map these high-dimensional count vectors to a lower dimensional representation in a so-called latent semantic space. As the name suggests, the goal of LSA is to reveal semantic relationships between the entities of interest (phrases, sentences, paragraphs, documents). LSA produces measures of word-word, word-passage and passage-passage relationships that compare well with human assessments of semantic similarity. The correlations demonstrate close resemblance between what LSA extracts and the way people understand what they have read and heard, and the way such understandings are reflected in the word choices of human writers. LSA is an obvious advancement over vector space analysis, and can discern meanings and distinguish polysemes in ways that simply cannot be accomplished by vector analysis. A good example of an LSA engine is Engenium’s SemetricTM search engine, which provides a SharePoint plug-in that adds LSA-type searching to SharePoint’s inherent vector space search capability. The result is improved precision.
Unfortunately, as typically implemented, LSA exhibits indifferent recall. This has led researchers to look for ways to improve both precision and recall together. The result is probabilistic search technology.
Probabilistic Search Models
We’ll treat two probabilistic search models here, although there are many more variants than two. The reason for choosing these two is that each ties to a product available to and used by the legal sector.
Bayesian Search Technology
At least one vendor, Autonomy, has based its search solution on Bayesian search technology. Bayesian search arises out of the work of Thomas Bayes, a British mathematician, statistician & religious leader in the 18th century. Searching in this model relies on statistical inference, in which evidence or observations are used to update or to newly infer the probability that a particular proposition is true. In concept searching, if we know the probabilities of words appearing in a certain category of document, given the set of words in a new document, we can correctly predict if this new document is or is not that category of document. This type of search technology is often described as Naïve Bayesian searching–Naïve because the underlying algorithms “naively” assume that the effect of a variable value on a given class is independent of the values of other variables. This assumption is called class conditional independence. It is made to simplify the computations of probabilities.
Bayesian search technology can yield very high precision in homogenous and stable document collections. But because it is modeled on a continuously updated set of “observations” about the occurrence and relationship of terms in the document repository, it is less useful in highly dynamic settings, with documents being added or changed frequently. Its statistical models are also better suited to collections with high homogeneity, as might happen, say, in a library centered on a specific topic. It is less well-suited to a collection of diverse documents. Also, notwithstanding its naïve assumptions, as typically implemented inside the enterprise, Bayesian search technology is computationally demanding. Those demands can require a great deal of computational resource or, alternately, lead to poor performance. While Bayesian technology can be optimized for specific environments, such as chemistry or pharmaceutical research, because of its limitations, it is less well-suited to the legal environment, and indeed relatively few firms have adopted it.
Probabilistic Latent Semantic Analysis
A much more powerful technology is Probabilistic Latent Semantic Analysis (PLSA), which goes several steps beyond Latent Semantic Analysis. Its strength is the ability to relate search terms to the aggregation of words that, together, have meaning inside a document. While the math of PLSA may be beyond the scope of this article (and perhaps beyond its author as well), it suffices to observe that PLSA measures the co-occurrence of terms in term space and then uses the probability of that co-occurrence to reduce term space dimensionality below that of LSA. Co-occurrence is illustrated in column 1 of figure 5 above. The terms in that column occur much more frequently together (with others) in documents that discuss federal constitutional governments than in documents that discuss other subjects. Because PLSA abstracts searching to co-occurrence dimensions rather than to strictly term dimensions, it is able to return accurate search results even when certain terms do not occur in the query. For example, a query might omit the term “president” and still return documents that discuss balance of powers in federal constitutional governments.
One of the strengths of PLSA is its ability to minimize “perplexity”-the tendency of a search technology to be “surprised” when it encounters documents outside of its initial or “training” collection (the document collection used to establish initial term/document and term/term relationships). Bayesian searching and LSA both are prone to high levels of perplexity (as are we all). PLSA, in contrast, has shown a remarkable, and beyond human, capability to predict meaning in previously “undiscovered” documents. This makes it especially well-suited for dynamic collections of the type typical in law firms.
Academic studies have shown that PLSA is much more likely than vector analysis, and significantly more likely than LSA, to find terms based on the concepts to which they are related. It is therefore better suited to serve at the foundation of enterprises where concept-finding and concept-aggregation are far more important than term-finding.
Platform Flexibility, Diversity and Fit
We have seen that PLSA-based search technologies such as that supplied by Recommind’s MindServer provide an exceptionally strong technical foundation for enterprise search. That, however, is not yet sufficient to make a selection of a particular technology to serve at the foundation of an enterprise. Three additional elements are required: platform flexibility, platform diversity and platform fit.
Before addressing each of those criteria, however, it is worth discussing what the goals of an enterprise search technology might encompass. As we have noted before, some who enter the search engine market have simple requirements: the desire to be able to initiate discrete searches for particular documents embodying work product. That is certainly one use of search technology, but a case can be made that it is, of all the uses of search technology, the least productive use of it.
Consider again the rise in volume and categories of electronically stored information facing modern business organizations. Our human instincts tell us to organize and categorize that information much as we have the paper that defined our work processes in the past. We strive to create folders and files much as we did with paper and, I daresay, with parchment and papyrus as well. But these manual “taxonomies” are impossible to maintain in the face of the information growth curves we now confront. Some more automated process is called for-one that is at least as reliable as human engendered taxonomies, and preferably far more so. What is wanted in the modern organization is a kind of information “gravity,” a fundamental force that acts to move information to exactly the place it needs to be-to permit a decision to be made, to permit a crucial fact to be discovered, to permit a conclusion to be drawn from an appropriate aggregation of information.
All of this is beyond the search “engine” as we have come to know it. Conducting discrete searches by pulling up a search box and typing a query may never disappear, but it is far too manual a process to provide any hope of managing the rising tide of electronic information. Rather, we should look for search technologies that are super-human in their capabilities and woven into platforms that can be used in every corner of the organization, with utter reliability and purpose. It is to these ends that we look for flexibility, diversity and fit.
Document searching is a key use of search technology inside a law firm. We lawyers create documents with abandon. But, in any given day, we do far more than search for documents. We confront, categorize and solve problems. We create teams, then manage them. We communicate with team members, with clients, with co-counsel and opposing counsel. Each of these activities spins off information and requires information to pursue effectively. Any search technology a law firm embraces needs to accommodate all such activities. It may be that instead of finding a document, what is needed is to find an expert who can create such a document. It may be that to communicate with a client, one needs to effectively aggregate and review a wide variety of information-financial, matter-related, historical, personnel-related, etc. At the pace of modern business, there is no time to manually compile such an array of information. A good search technology, on the other hand, should be able to do so, and without need for a series of discrete queries. It should be flexible enough to assimilate documents, time entries, database entries, matter descriptions, billing records and much more into a uniform view attuned to whatever problem is sought to be solved. We chose Recommind’s MindServer precisely because it exhibited such flexibility. We have already applied it to create a kind of gravity inside work portals, and each day brings new potential for similar expanded applications. Many search technologies would fail even this first test. They are suited to one task and one task only.
Not all organizations will have the technical capability to adopt foundation level technologies to solve new problems. A good enterprise search technology should already have solved a number of problems likely to be faced by the enterprise. In Recommind’s case, they already have applications that use PLSA to categorize documents, identify experts, identify documents, associate emails and other electronic information with clients and matters, categorize matters and conduct e-Discovery-all activities likely to be undertaken inside a law firm. That out-of-the-box diversity eliminates problems associated with attempting to integrate separate information management solutions (and there are many on the market in each of the areas just mentioned). Mindserver has already been adapted to those uses, and my firm need not learn anew the requirements and constraints of new technologies in each of those areas.
Law firms are specialized creatures, and their requirements are often unique and non-intuitive to outsiders. The same may be said of other disciplines-medicine, pharmacology, chemistry, architecture, etc. Both the search technology itself and the company that serves up such technology must fit the organizations they serve. Recommind has specialized in the legal vertical (and some others) for years, and as a consequence, their current products, their support services and their product pipeline are extremely well adapted to the needs of law firms and legal departments. There are other strong technologies in the marketplace, but adapting a generic technology to the needs of an entity inside a new and alien industry can dwarf the acquisition and installation costs of such a technology. The advantage of fit is that every new buyer can trade on the experience of all the buyers before. In the legal sector, we determined that Recommind offered far and away the best fit of all possible search technologies,
For all of the reasons discussed above, we have found that Recommind’s MindServer Probabilistic Latent Semantic Analysis search engine is particularly well-suited for law firm and professional services environments–because of the inherent power of PLSA and because Recommind’s strong, flexible and diverse search platform is focused on solving problems inherent in the legal enterprise.
 The core Naïve Bayesian logic may not itself be computationally demanding, but as typically deployed, with what can include hundreds of supplemental and compensatory algorithms, such search technologies can be quite demanding and relatively difficult to tune.