E-Discovery Update: Recognizing Hidden Logistical Bottlenecks in E-DiscoveryBy Conrad J. Jacoby, Published on April 24, 2007
Litigants requesting documents and those producing documents in discovery often clash about the speed with which materials are produced. Typically, receiving parties cannot understand why materials aren’t made available sooner, while producing parties often complain that they are not given enough time to meet their legal obligations without significant overtime and stress. How much time do things really take?
Leaving aside the process of processing electronic documents (which means different things to different vendors and review platforms) and subjectively reviewing documents for relevance (which means different things to different law firms and in different types of legal disputes), one substantial but hidden task in the e-discovery life cycle is the process of simply moving electronically stored files and information from one location to another. In everyday use, it takes only a few seconds to copy a file from one folder to another or from a hard drive to a USB flash drive. Here, it can take days or weeks. Why?
For starters, it’s easy to lose perspective on exactly how much data is typically involved in e-discovery projects. While copying a few files may only take a minute or two, typical e-discovery projects involve hundreds of thousands of files, if not millions. The hard drive for a single Microsoft Windows-based personal computer contains over two hundred thousand system and data files—and that’s before users begin storing their own data. Automatically-created files, such as Internet browser cached data and software-controlled backup sessions, contain unique information that might be relevant to a dispute, even though they were not created in a direct sense by the computer operator. For completeness of the preservation effort, these may need to be captured just as much as the material in a “My Documents” folder. Indeed, because of the diverse places that users can store potentially relevant data on a computer, it’s a fairly standard practice to capture all data on a hard drive to prevent preservation/spoliation issues from arising if the case changes focus at a later point. Harvesting network-based data is still often done on a folder-level or network share basis, but some of these “subsets” can be enormous. One-gigabyte and larger network shares are increasingly common.
Many data harvesting professionals or services use a forensic tool like EnCase to make a bit-level copy of target hard drives and network directories. When using these applications, a popular rule of thumb is that every forty (40) or so gigabytes of a computer hard drive takes about an hour to copy. 40 gigabytes per hour may not seem particularly fast, but this is actually a speedier data transfer rate than copying data on a file-by-file basis, while also preserving all file-related metadata. However, as hard drives get bigger (most laptop computer currently ship with 80+ gigabyte drives, and many desktop computers ship with 300+ gigabyte drives), it can take as much as eight hours to copy a single hard drive depending on its size and the processing and hard drive speed of the computer creating the bit-level image. Large network shared volumes can take days, even weeks to copy because the copying process often shares resources with other data read/write requests so that a server can remain in service while its relevant data is duplicated.
As if the time it takes to make a bit-level copy isn’t enough, a further good practice is to verify the copy to ensure that it is error-free and identical to the original. The verification process requires software to read the data on two separate hard drives and compare it on a bit by bit basis to make sure that the data is identical. Needless to say, verifying a drive image can take longer than making the copy in the first place.
Once the data has been harvested from its original source, it’s a good idea for litigants to make a backup or working copy of the “original” data that has been harvested. This redundancy guards against the possibility of data loss if a mechanical hard drive fails, taking with it data that is no longer available on the original data storage location Again, because it is faster to copy a small number of large files than a larger number of smaller files, vendors often duplicate bit-level (e.g., EnCase) images so that they can “roll back” to the point of data acquisition if any questions arise. Because bit-level images contain all individual file metadata within their structure, these images can be copied using ordinary file utilities without affecting the source data they contain in any way. This reduces the total amount of time needed to copy this data, though the total amount of data being backed up can still turn this into a multi-day (sometimes even multi-week) process.
Things don’t necessarily speed up once a working copy is ready for further processing. Capturing files and storing them in a compressed format may be convenient and relatively fast, but they cannot be analyzed or readied for review until they are returned to their uncompressed formats. A single large EnCase image may take a day to extract, even on a high-powered computer workstation. Even then, further work is often required. Many users store large files or groups of files in .ZIP or other compressed format. These files must be manually identified and decompressed. The ZIP format supports compressed files within compressed files, so it is also possible to find .ZIP files within a .ZIP file. All of these files must be opened and extracted before files can be processed. Many times, some degree of human intervention is required to identify, queue up, and monitor this particular decompression process, since it is applied to only selected files within a data collection.
Once all the target data is available in uncompressed form, it may be time to make yet another backup copy of the data collection. Few if any ESI processing vendors will work with original data because of potential liability for inadvertently altering information. Instead, careful vendors generally require a working copy of the harvested data or else permission for them to immediately make their own bit-level copy of the materials. Bit-level copying is generally not required at this point, so less data per hard drive is being copied—speeding up the process. However, it can take significantly longer to copy many small files than a few large EnCase images. Balancing these two factors out, creating the working copies before starting actual processing can still take as many machine-hours as initially harvesting the data. For large ESI collections taken from many computers, creating a single working copy of all the data can take several weeks, even when the task is distributed over multiple workstations or server blades.
At the end of these efforts, harvested data is finally ready for loading into a document review platform or processing system. “Loading” is the key word, as the data must be copied one last time to make it available for processing. Some processing systems read data directly from the data provided by clients, potentially reducing the need to create a backup copy before starting processing, but this “loading” task essentially copies files on an individual basis into memory or to a new location as part of the overall processing workflow.
Any of these tasks, of course, can be substantially increased by large projects to multiple computers or CPUs, all of which then process portions of the project in parallel. An array of coordinated CPUs can reduce the time required from months to weeks, or from weeks to days. However, the exact time savings depends on the number of machines dedicated to these tasks, which is directly influenced by client budgets and the amount of hardware available for use. In addition, distributed multi-tasking is not infinitely scalable; it is still limited by hardware constraints. Many blades can decompress data simultaneously, but writing the data to a common file server or common database—so that all the disparate pieces can be knitted back into a single document collection—will limit the maximum possible data throughput.
Because it is certainly one of the less exotic aspects of electronic discovery practice, it’s easy to overlook the importance of transporting and backing up electronically stored discovery materials. However, understanding the impact of these tasks on the e discovery timeline can help attorneys ensure that they do not misunderstand—or misrepresent—the speed with which a client can respond to questions or requests from either opposing counsel or an interested judge.