Reading old news in the web archive, distantly

[The substance of this post has now been published.]

One of the defining moments of Rowan Williams’ time as archbishop of Canterbury was the public reaction to his lecture in February 2008 on the interaction between English family law and Islamic shari’a law. As well as focussing attention on real and persistent issues of the interaction of secular law and religious practice, it also prompted much comment on the place of the Church of England in public life, the role of the archbishop, and on Williams personally. I tried to record a sample of the discussion in an earlier post.

Of course, a great deal of the media firestorm happened online. I want to take the episode as an example of the types of analysis that the systematic archiving of the web now makes possible: a new kind of what Franco Moretti called ‘distant reading.’

The British Library holds a copy of the holdings of the Internet Archive for the .uk top level domain for the period 1996-2010. One of the secondary datasets that the Library has made available is the Host Link Graph. With this data, it’s possible to begin examining how different parts of the UK web space referred to others. Which hosts linked to others, and from when until when ?

This graph shows the total number of unique hosts that were found linking at least once to archbishopofcanterbury.org in each year.

Canterbury unique linking hosts - bar

My hypothesis was that there should be more unique hosts linking to the archbishop’s site after February 2008, which is by and large borne out. The figure for 2008 is nearly 50% higher than for the previous year, and nearly 25% higher than the previous peak in 2004. This would suggest that a significant number of hosts that had not previously linked to the Canterbury site did so in 2008, quite possibly in reaction to the shari’a story.

What I had not expected to see was the total number fall back to trend in 2009 and 2010. I had rather expected to see the absolute numbers rise in 2008 and then stay at similar levels – that is, to see the links persist. The drop suggests that either large numbers of sites were revised to remove links that were thought to be ‘ephemeral’ (that is to say, actively removed), or that there is a more general effect in that certain types of “news” content are not (in web archivist terms) self-archiving. [Update 02/07/2014 – see comment below ]

The next step is for me to look in detail at those domains that linked only once to Canterbury, in 2008, and to examine these questions in a more qualitative way. Here then is distant reading leading to close reading.

Method
You can download the data, which is in the public domain, from here . Be sure to have plenty of hard disk space, as when unzipped the data is more than 120GB. The data looks like this:

2010 | churchtimes.co.uk | archbishopofcanterbury.org | 20

which tells you that in 2010, the Internet Archive captured 20 individual resources (usually, although not always, “pages”) in the Church Times site that linked to the archbishop’s site. My poor old laptop spent a whole night running through the dataset and extracting all the instances of the string “archbishopofcanterbury.org”.

Then I looked at the total numbers of unique hosts linking to the archbishop’s site in each year. In order to do so, I:

(i) stripped out those results which were outward links from a small number of captures of the archbishop’s site itself.

(ii) allowed for the occasions when the IA had captured the same host twice in a single year (which does not occur consistently from year to year.)

(iii) did not aggregate results for hosts that were part of a larger domain. This would have been easy to spot in the case of the larger media organisations such as the Guardian, which has multiple hosts (society,guardian.co.uk, education.guardian.co.uk, etc.) However, it is much harder to do reliably for all such cases without examining individual archived instances, which was not possible at this scale.

Assumptions

(i) that a host “abc.co.uk” held the same content as “www.abc.co.uk”.

(ii) that the Internet Archive were no more likely to miss hosts that linked to the Canterbury site than ones that did not – ie., if there are gaps in what the Internet Archive found, there is no reason to suppose that they systematically skew this particular analysis.

Religion, politics and law in contemporary Britain: a web archive

[This is an expanded version of a post first published in the UK Web Archive blog.]

It has been over two years in the making, but I am delighted to be able to say that my own special collection in the UK Web Archive is now online.

UKWA (for which I am engagement and liaison lead, based at the British Library) collects and preserves websites of scholarly and cultural importance for the UK web domain. Already UKWA collect some 11,000 sites, and has more than 50,000 instances in total, with series of snapshots of some sites going back the best part of a decade. That’s a lot of data, and so one of the ways into the archive is by means of the special collection, of sites on a particular theme.religion politics law thumbnail

A couple of years ago, long before coming to the BL, I joined a project at the Library which brought together a group of scholars to guest-curate special collections on our research interests. I had become interested in the sharpening of the terms of debate about the place of religion in British public life, particularly since 9/11 and the London bombings in 2005. I’ve long been interested in public debate about church and state; but until relatively recently this happened by means of the print press, public oratory, ephemeral publication and the broadcast media. It struck me that a good deal of this debate had already moved online, and so new ways of capturing and preserving it were going to be needed. And so, the ‘politics of religion collection’ (as it was then known) was born. (See these posts on my progress.)

I fairly soon realised why I’m not an archivist, since all sorts of unfamiliar questions hove into view. When archiving the web, what is the base unit ? A whole domain, such as www.bbc.co.uk ? Or a single URL ? Several sites, like that of the National Secular Society or the Christian Institute were central to my concerns, and so could be included whole. But what does one do with a single post on a PR blog about the handling of the sharia law row by Rowan Williams and his staff ? In fact, the collection is a mixture of whole domains and individual directories or pages from larger sites; an uneasy compromise, but a necessary one.

Also (and I may as well come straight out with it), the collection is selective, and thus in a real sense subjective. As a watcher of contemporary religious politics, against the backdrop of recent history, my impression is that the place of religious ideas, symbols and organisations in public life is at its most contested for decades. Historians are traditionally wary of assessing the significance of present trends, since it leaves hostages to fortune and later events. Yet, all archival choices from a pool of material not defined in advance by provenance involve some judgements as to significance; and historians are as well suited as any to make those judgements. And so I have put the collection together now to enable future historians to begin to answer the questions which I anticipate will be significant. (See an older post on why I think historians should engage with this way of working.)

There were other issues. Were I the archivist for a particular organisation, I’d have no problem with getting permission to add material to my archive: everything produced in-house would be in view. The problem for web archiving is that we’re dealing with other people’s copyright work, and so an individual permission is needed for each site. I have a long list of sites which I would dearly love to add to the collection, but for which (for various reasons) we’ve had no response. So, if you are the owner of Protest the Pope, or Holy Redundant, or Christians in Politics, please get in touch. For now, even if the collection cannot be anything like comprehensive, I do hope that it is at least coherent.

There are particular strengths, and some gaps. It includes many campaigning organisations, both secularist and religious, and is heavy on the conservative Christian groups about which I myself know most. It is very light on non-Christian faiths, since I know the field much less well.  It is still very much open, however, and so suggestions of sites that ought to be included are very welcome, via this blog or at the UKWA Nominate a Site page.

What can you do with it ?  For now, there is a simple browse function; and the collection can be searched on its own.  And over time, all sorts of uses will present themselves, which we can’t currently imagine. But the data is there: a growing longitudinal series of timed instances of websites, identified as thematically related; that is to say, an archive.