Notes from the Technology Trenches –; Filtering MS Word Documents into HTML

Roger Skalbeck is the Technology Services Librarian and Webmaster at George Mason University School of Law in Virginia, and he is a web committee member for the Law Librarians’ Society of Washington, D.C. Opinions expressed in this column do not necessarily reflect those of his employer or any other organization. This column, of course, is 100% free of any legal advice.; Filtering MS Word Documents into HTML

A new service for obtaining patents on the Internet is just coming out of its beta testing phase, and if you have any regular need for researching or obtaining patents, is worth a close look. This service provides fast and efficient access to United States patents, which can be obtained as individual documents or in groups. One big attraction of this new service is that each multi-page document is downloaded as a clean, crisp file, with all pages in a single navigable document. You can either retrieve them on the site directly or request to have batches of patents sent to you via email.

Search options for include many of the standard searches that you would expect, including patent number, inventor, assignee, full-text searching and related field options. In terms of extensive features and sheer speed, power patent searchers will still likely prefer patent files on a service like Dialog or Lexis-Nexis for complex fielded or multi-level Boolean searches. Nonetheless, is still useful for researching patents, and there are some unique search options in this browser-based interface. Also, in terms of obtaining imaged patent documents is really quite impressive.

For extended searching examples, from viewing the text of the front-page information of a patent, most of the fields are set up as hypertext links for submitting as subsequent searches. This works like many modern web-based catalogs: by clicking on a subject heading or entity like an author, you perform a search to retrieve all matching documents. In, this works especially well for searching by patent classification, examiner and to a great extent by assignee. However, it is not as useful as an automated tool for searching the legal representation, as it requires a character-by-character match, which can be difficult with changing firm names and omission of partnership designation or even basic punctuation.

Coverage & Competition

Searchable patents available at this site date back to 1976, and utility patents, by far the biggest group of patent documents, can be obtained by patent number back through 1920. The archive is updated on a weekly basis with new patents. At present, only United States patent documents are available, and the web site does not mention any plans for expanding coverage to include documents of international patenting bodies. Considering that two of the biggest potential competitors include non-US documents as part of their Web-based services, I’ll be surprised if remains a source for solely “domestic” patent documents. These competitors, Micropatent and the Delphion Intellectual Property Network (aka the IBM Patent Server), have a slightly different pricing and orientation model, so it is likely that they can all easily co-exist as providers of patent documents and information.

Very broadly stated, Micropatent has a fee-based patent search system and a transaction-based system for downloading patent images. The Delphion/IBM model provides free access to many features, which are balanced by online advertising and the cumbersome one-page-at-a-time interface for getting patent images. At this site, patents can subsequently be purchased to download in multiple formats or ordered in print through a vendor partner. offers subscribers access to searching and downloading documents on a per-month subscription basis, priced by a set number of bundled patents. Following is a quick snapshot of pricing for Cartesian Products’, as of press time for this article.


Monthly Fee

Free patents /month

Price per additional patent













It will be interesting to see if decides to adopt a strictly usage-based pricing option for users who want individual patents without the subscription. For information on pricing and options of the other two services, refer to their respective web sites.

Technology & Requirements

At the heart of this new service is a compression technique known as Cartesian Perceptual Compression (CPC), which is a file compression technique that was developed by Cartesian Products, Inc. to provide high-quality images that take up a fraction of the storage space of other image formats. In brief, this means that CPC-compressed documents take up less space than those in resident Adobe Acrobat (PDF) or TIFF format, and there appears to be no noticeable difference in quality. This is very attractive, as it also means quicker download times, and the documents received via email take up less server space.

The Cartesian Products’ web site provides a detailed Technical Overview of Cartesian Perceptual Compression, with information on how this compares to other document imaging and compression methods. CPC is a compression method that has been around for a few years now, used at least as far back as 1997, when it was chosen as the storage method for JSTOR, which is a not-for-profit organization affiliated with academic universities dedicated to archiving long runs of journal titles in image format. In the case of JSTOR, they use CPC for compression, but don’t provide documents in a resident CPC format, which a JSTOR representative clarified for me. (Note: Some details of JSTOR’s rationale for choosing CPC can be found in the report: The impact of electronic journals on local network computing and printing environments).

From a purely practical standpoint, the main thing that should be noted is that CPC format requires that the user download and install a specialized CPC viewer as a browser plug-in to be able to see and print documents. It works similar to some of the commercial multi-page TIFF viewers that I have seen, allowing for page rotation and navigation as well as multi-page viewing and (in the enhanced version) user annotations. It comes in a free CPC Lite version as well as the more enhanced CPC View, which currently sells for $19.95. The Lite version downloaded quickly, and was fairly easy to download and install.

HTML Filtering and Conversion from MS Word

If you have ever had to take Microsoft documents and convert them for use on a web site, it’s pretty likely that you will have had some frustration in doing so. It seems that the more that Microsoft advances with their products, and the more feature-rich they become, the more difficult it is to bring things down to a very basic level for use on a web page. Over the recent months, I have had to deal more and more with the task of converting documents from Microsoft Word 2000 into what I had hoped to be clean and compact HTML. While I have not yet found the perfect solution for every occasion, I have learned a few tricks that are worth sharing.

In a Windows environment, if you are copying text or sections from one application to another (as opposed to converting whole documents), consider using the “Paste Special” option that many applications now employ. In this instance, you copy something from a web page, word-processed document or other entity, and then the options of the target application (such as your HTML editor, email program or word processor) determine how flexible your paste operation can be. As an example, MS-Word 2000 offers the options of pasting text in at least the following formats: formatted text (RTF), unformatted text, HTML format and unformatted Unicode Text. This option is generally found under the “Edit” menu as “Paste Special”. If you don’t know the difference between these options, try each of them in turn to see the differences in how they perform.

If you are interested in converting documents or whole sections of documents into HTML, you have a variety of options, some of which are built in to the applications. Of course, both MS-Word 97 and MS-Word 2000 allow you the option to save documents in an HTML format, but I have found that these tend to add a lot of proprietary code, and can do some unexpected things in rendering your formatted text. Another option that the end-user has is to employ the Microsoft Office 2000 HTML Filter 2.0. According to the main product page, this is a tool “you can use to remove Office-specific markup tags embedded in Office 2000 documents saved as Hypertext Markup Language (HTML).” In using this, I found that it helped somewhat to strip out “Office-specific markup tags”, but it was not a complete cleaning process. Especially with respect to tables, I find that the latest version of Microsoft Word adds a lot of extra coding and formatting for HTML conversion. For information from Microsoft on how to best use this filter, see their document: “How to use the HTML Filter in Word 2000.

As a somewhat low-tech alternative to the Office 2000 HTML Filter 2.0 product, you might want to check out the ANT_HTML program, especially if you are using an earlier version of Microsoft Word. This is a program from Tela Communications that converts Microsoft Word documents to HTML and will also work to convert HTML to WYSIWYG format (though I didn’t test out the latter options).

From reading the documentation about this product, it looks as though it was written more for earlier versions of Microsoft Word, but it does work with Word 2000 rather well. In testing it out, I took a document with a few tables in it and converted the whole thing into the ANT HTML. Unlike Microsoft’s own HTML filter, this process left no residual proprietary tags (such as sizing, style or font attributes) or other formatting that might conflict with my own documents. In brief, this program provides additional toolbars within Microsoft Word that perform macro transformations of documents to strip out application-specific codes, rendering baseline HTML documents. These toolbars can also be used for building HTML forms, performing user-defined macro transformations and a host of related “clean-up” tasks.

I am still not 100% satisfied with either conversion utility, especially because I really like to see clean HTML without extra coding to increase document size and thus download time. Nonetheless, both have virtues for helping to bring formatted Word documents to a level that can be incorporated on web sites in semi-clean HTML format. If any readers out there have suggestions for additional conversion utilities, please let me know, as I’d love to test them out and report on the findings.

As always, if you have questions or comments on this column, please don’t hesitate to send me an email .

Web Sites Mentioned in this column:


Delphion Intellectual Property Network

Cartesian Perceptual Compression

Technical Overview of Cartesian Perceptual Compression


The impact of electronic journals on local network computing and printing environments

CPC View

Office 2000 HTML Filter 2.0

How to use the HTML Filter in Word 2000


Tela Communications

Copyright © 2000 Roger V. Skalbeck. All Rights Reserved.

Posted in: Intellectual Property, Internet Resources, Notes from the Technology Trenches, Technology Trends, Web Management