A bit about PURLs

One of my first jobs after finishing an MLS was as a “metadata librarian” at Old Dominion University. This was back in 1998, so it was fitting that one of my first projects was cleaning up all these newfangled URLs that had started popping up in our online catalog.Many of these records flowed directly in because ODU, like many libraries around the US, participated in the Federal Depository Library Program, which meant that we imported MARC records for the GPO publications that the library selected.

Many of these MARC records included 856 fields which pointed at digital equivalents on the emerging web. Sometimes the publications were only available online, which seemed to be the case for datasets that were no longer being distributed on CD-ROM. In these times of Bandcamp getting datasets sent on physical media by the post seems like a feature not a bug right? Maybe some are still distributed this way? The idea that you could click on some text in the catalog (a link) and read the described document was a revolution in consciousness–when it worked. Unfortunately many of these URLs had broken even in the short time that they were in the catalog. They needed to be either updated or removed so that library patrons didn’t go down 404 dead ends when consulting the catalog.

That’s when I first ran across the idea of the Persistent Uniform Resource Locator or PURL. I noticed that some of the URLs were starting to all have the same hostname purl.fdlp.gov. PURLs were created by Stuart Weibel and Erik Jul at OCLC in 1995, and were picked up by the GPO in 1997 as a way to mitigate broken URLs. The FDLP and GPO are still using PURLs today. I guess PURL is the original URL shortener. But it was created not to abbreviate otherwise long and otherwise cumbersome URLs, but to make them more resilient and persistent over time. You could put a PURL into a catalog record and if the URL it pointed to needed to change you changed the redirect on the PURL server, and all the places that pointed to the PURL didn’t need to change. It was a beautifully simple idea, and has influenced other approaches like DOI and Handle. But this simplicity depends on a commitment to keeping the PURL up to date, because:

… every PID is a service commitment involving at least some sweat equity. (Kunze, 2018)

PURLs are only as good as the maintenance work that has gone into updating the underlying URLs when they inevitably change. And in the lucky cases where the underlying URL haven’t changed, all the work that has gone into managing the infrastructure behind that URL namespace in order for that URL to stay the same. Updating to remain the same, with apologies to Wendy Chun.

I was reminded of PURLs recently when I happened to notice a link to a broken PURL in some old documentation about RSS 1.0. In this case Carl fixed the PURL not by updating the PURL to point at the correct place, but by putting a copy of the document where the PURL expected it to be.

This got me into looking at how the PURL service worked. As the Wikipedia article sketches out, PURLs have had a variety of implementations over the years. From the HTML at purl.fdlp.gov it looks like they are running the Zepheira / 3 Round Stones version–it’s not clear to me where one of those began and the other ended, and if any of that software is open source or not.

In 2016 OCLC and the Internet Archive announced that the venerable purl.org service had moved to the Internet Archive. In the process the software was completely rewritten (I’ve heard it’s a Python application, but I don’t think the source code has been released). One of the interesting side effects of this rewrite is that the new PURL server uses IA’s backend storage as a kind of database–and that storage is public.

So every PURL namespace is an item in the PURL Data Collection. This collection contains a couple JSON files that describe the PURL/URL mappings contained in that namespace as well as who maintains it, and a history of its changes. So for example the PURL for the Dublin Core Terms namespace is:

https://purl.org/dc/terms

and this IA item contains the metadata for that namespace:

https://archive.org/details/purl_dc_terms/

which really is a s3 like bucket that contains two JSON files:

If you squint a bit the PURL system kind of looks like a static site, in that the data it uses is all sitting in static files in their S3-like storage. Of course it’s not really like a static site because there is a dynamic web application reading these files, doing the redirects, and no doubt caching things. Also these seemingly static files are updated when a PURL is edited. But it’s nice that the data is all in JSON files sitting out there for anyone to look at. In some ways it’s not unlike the approach taken by the w3id project, which is basically to use Git and Apache to version identifier namespaces. This is fanciful, but both approaches kind of remind me of the exposed pipes on the Centre Pompidou, where the color of each pipe represents a different function (air, electricity, water, escalators, etc).

The w3id approach is nice because you can get all the data with git clone on the command line. Doing the same thing with purl.org is possible, but takes a little bit more work because you have to use the Internet Archive API to first get all the items in the PURL Data Collection and then go and fetch the JSON files for all of those items.

I decided to give this a try over in this Jupyter notebook. Once you have the data it’s possible to ask some questions that weren’t easy to ask before.

How often are new PURL namespaces created?

A few obvious things to note here are that PURLs started in 1995, and we only see 2009 on. Also 2009 is off the charts in terms of the number of namespaces that were created that year. One can only assume that the underlying system was modified then to start recording when the namespace was created. Things look a bit more interesting if you ignore the 2009 backlog.

Here you can clearly see that more PURL namespaces are being created after the move to the Internet Archive in 2016.

Similarly it’s possible to use the history files to see how often namespaces are being updated after they were created.

This chart tells an opposite story, which is more difficult to read. The number of namespace updates seems to have dropped off drastically after the move to the Internet Archive. But the low bar in 2021 is still 1,021 updates, or like 3 namespace changes per day. I’m not entirely sure what to expect. Perhaps there were systems tied to the PURL system that got disconnected when it moved?

Finally one of the more salient questions to ask of this data is how many of the PURLs still work? This is a complex enough question for research project and not just a quick blog post. Over in the notebook I started by sampling all the target URLs (N=405,637 n=662). The idea of checking each and every URL again was more like a full science project. In the process I noticed that it was oversampling some domains quite a bit like my.yoolib.net. So I tried again, but instead of sampling all the URLs I sampled the PURL namespaces (N=21,894, n=644) and picked a random URL from each PURL namespace. This seemed to work better but still appeared to oversample, with host names like www.olemiss.edu showing up quite a bit. It looks like they might create a new PURL namespace for every finding aid they put up?

Of course, testing whether a URL still works is surprisingly tricky business: the response could be 200 OK but say Not Found, or it could be a totally different page (content drift) (Zittrain, Bowers, & Stanton, 2021). A server could redirect permanently or temporarily to other pages for lots of reasons. Sites can be temporarily offline, etc. But this didn’t stop me from writing a pretty simplistic checker (recalling my days writing Perl at ODU in the 90s) that allowed me to tally the results for the sample:

One thing to note here is that requests (the Python HTTP library I used) automatically followed redirects, so that’s why you don’t see any 3XX status codes here. The ??? results are when a TCP/IP error occurred, which (from just scanning the results in the notebook) look to largely be the result of a DNS failure when the host name no longer can be looked up (DNS).

Perhaps it’s more interesting to look at the status codes over namespace creation time. Here it’s important to ignore 2009 since so many were retrospectively assigned to that bucket. This chart is interactive to make it a bit easier to read.

Since the Internet Archive also run the Wayback Machine it might make sense for their PURL service to redirect to archived content when the resource is no longer available? I noticed that the PURL. I noticed that the FDLP PURL service presents a history page for PURLs that have gone dead, or are no longer “traceable”. For example take a look at https://purl.fdlp.gov/GPO/LPS57016. It’s almost like a list of “last known addresses”. Both really might be good places to link to the Wayback Machine when a snapshot exists?

I guess I’ll stop here for now. Maybe this little post will whet someone else’s appetite to looking closer at PURL as a web infrastructure. I think it also is a great example of how exposed pipes are useful when building applications that are meant to be infrastructure. I started out wanting to describe how these PURLs are showing up as data but it ended up going in a few directions at once.

One little bit of PURL administrivia that I’ll leave you with that since the PURLs are all part of a collection at the Internet Archive they also have an RSS feed you can use to watch as new namespaces are created:

https://archive.org/services/collection-rss.php?collection=purl_collection

And if you are interested in looking at the full and sample datasets, but don’t want to run the notebook yourself, they are here, with the caveat that they are just a snapshot in time:

References

Kunze, J. (2018). Ten persistent myths about persistent identifiers. University of California. Retrieved from https://escholarship.org/uc/item/73m910w8

Zittrain, J. L., Bowers, J., & Stanton, C. (2021). The Paper of Record Meets an Ephemeral Web: An Examination of Linkrot and Content Drift within The New York Times (SSRN Scholarly Paper No. ID 3833133). Rochester, NY: Social Science Research Network. https://doi.org/10.2139/ssrn.3833133

This paints a better picture of the resolvability of the PURLs. One thing I noticed while watching the sampled URLs get checked is that a fair number in the “unknown” category appeared to be the result of SSL certificate errors. So to do this properly it would be good to dive into what’s going on in that category. It might also be good to do stratified sampling by year, so the years can be more easily compared?