New article: On digital contemporary history

A little article of mine has just appeared in the Danish historical journal Temp, based on a lecture given in Copenhagen to the Danish Assocation for Research in Contemporary History in January 2016.

It suggests that there has been a relative lack of digitally enabled historical research on the recent past, when compared to earlier periods of history. It explores why this might be the case, focussing in particular on both the obstacles and some missing drivers to mass digitisation of primary sources for the 20th century. It suggests that the situation is likely to change, and relatively soon, as a result of the increasing availability of sources that were born digital, and of Web archives in particular. The article ends with some reflections on several shifts in method and approach, which that changed situation is likely to entail.

By the kind permission of the editor, I make it available here.

Title:  Digital contemporary history: sources, tools, methods, issues
Details: Temp: Tidsskrift for historie, 14 (2017), 30-38.
Download the PDF

Why hoping private companies will just do the Right Thing doesn’t work

In the last few weeks I’ve been to several conferences on the issue of the preservation of online content for research, and in particular social media. This is an issue that is attracting a lot of attention at the moment: for examples, see Helen Hockx-Yu’s paper for last year’s IFLA conference, or the forthcoming TechWatch report from the Digital Preservation Coalition. As I myself blogged a little while ago, and (obliquely) suggested in this presentation on religion and social media, there’s growing interest from social scientists in using social media data – most typically Twitter or Facebook – to understand contemporary social phenomena. But whereas users of the archived web (such as myself) can rely on continued access to the data we use, and can expect to be able to point to that data such that others may follow and replicate our results, this isn’t the case with social media.

Commercial providers of social media platforms impose several different kinds of barriers: These can include: limits on the amount of data that may be requested in any one period of time; provision of samples of data created by proprietary algorithms which may not themselves be scrutinised; limits on how much of and/or which fields in a dataset may be shared with other researchers. These issues are well-known, and aren’t my main concern here. My concern is with how these restrictions are being discussed by scholars, librarians and archivists.

I’ve noticed an inability to imagine why it is that these restrictions are made, and as a result, a struggle to begin to think what the solutions might be. There has been a similar trend amongst the Open Access community, to paint commercial academic publishers as profit-hungry dinosaurs, making money without regard to the public good element of scholarly publishing happens. Regarding social media, it is viewed as simply a failure of good manners when a social media firm shuts down a service without providing for scholarly access to its archive, or does not allow free access to and reuse of its data to scholars. Why (the question is implicitly posed) don’t these organisations do the Right Thing? Surely everyone thinks that preserving this stuff is worthwhile, and that it is a duty of all providers?

But private corporations aren’t individuals, endowed with an idea of duty and a moral sense. Private corporations are legal abstractions: machines designed for the maximisation of return on capital. If they don’t do the Right Thing, it isn’t because the people who run them are bad people. No; it’s because the thing we want them to do (or not do) impacts adversely on revenue, or adds extra cost without corresponding additional revenue.

Fundamentally, a commercial organisation is likely to shut down an unprofitable service without regard to the archive unless (i) providing access to the archive is likely to yield research findings which will help future service development, or; (ii) it causes positive harm to the brand to shut it down (or helps the brand to be seen *not* to do so.) Similarly, they are unlikely to incur costs to run additional services for researchers, or to share valuable data unless (again) they stand to gain something from the research, however obliquely, or by doing so they either help or protect the brand.

At this point, readers may despair of getting anywhere in this regard, which I could understand. One way through this might be an enlargement of the scope of legal deposit legislation such that some categories of data (politicians’ tweets, say, given the recent episode over Politwoops) are deemed sufficiently significant to be treated as public records. There will be lobbying against, surely, but once such law is passed, companies will adapt business models to a changed circumstance, as they always have done. An even harder task is so to shift the terms of public discourse such that a publicly accessible record of this data is seen by the public as necessary. Another way is to build communities of researchers around particular services such that generalisable research about a service can be absorbed by the providers, thus showing that openness with the data leads to a gain in terms of research and development.

All of these are in their ways Herculean tasks, and I have no blueprint for them. But recognising the commercial realities of the situation would get us further than vague pieties about persuading private firms to do the Right Thing. It isn’t how they work.

Reading old news in the web archive, distantly

One of the defining moments of Rowan Williams’ time as archbishop of Canterbury was the public reaction to his lecture in February 2008 on the interaction between English family law and Islamic shari’a law. As well as focussing attention on real and persistent issues of the interaction of secular law and religious practice, it also prompted much comment on the place of the Church of England in public life, the role of the archbishop, and on Williams personally. I tried to record a sample of the discussion in an earlier post.

Of course, a great deal of the media firestorm happened online. I want to take the episode as an example of the types of analysis that the systematic archiving of the web now makes possible: a new kind of what Franco Moretti called ‘distant reading.’

The British Library holds a copy of the holdings of the Internet Archive for the .uk top level domain for the period 1996-2010. One of the secondary datasets that the Library has made available is the Host Link Graph. With this data, it’s possible to begin examining how different parts of the UK web space referred to others. Which hosts linked to others, and from when until when ?

This graph shows the total number of unique hosts that were found linking at least once to in each year.

Canterbury unique linking hosts - bar

My hypothesis was that there should be more unique hosts linking to the archbishop’s site after February 2008, which is by and large borne out. The figure for 2008 is nearly 50% higher than for the previous year, and nearly 25% higher than the previous peak in 2004. This would suggest that a significant number of hosts that had not previously linked to the Canterbury site did so in 2008, quite possibly in reaction to the shari’a story.

What I had not expected to see was the total number fall back to trend in 2009 and 2010. I had rather expected to see the absolute numbers rise in 2008 and then stay at similar levels – that is, to see the links persist. The drop suggests that either large numbers of sites were revised to remove links that were thought to be ‘ephemeral’ (that is to say, actively removed), or that there is a more general effect in that certain types of “news” content are not (in web archivist terms) self-archiving. [Update 02/07/2014 – see comment below ]

The next step is for me to look in detail at those domains that linked only once to Canterbury, in 2008, and to examine these questions in a more qualitative way. Here then is distant reading leading to close reading.

You can download the data, which is in the public domain, from here . Be sure to have plenty of hard disk space, as when unzipped the data is more than 120GB. The data looks like this:

2010 | | | 20

which tells you that in 2010, the Internet Archive captured 20 individual resources (usually, although not always, “pages”) in the Church Times site that linked to the archbishop’s site. My poor old laptop spent a whole night running through the dataset and extracting all the instances of the string “”.

Then I looked at the total numbers of unique hosts linking to the archbishop’s site in each year. In order to do so, I:

(i) stripped out those results which were outward links from a small number of captures of the archbishop’s site itself.

(ii) allowed for the occasions when the IA had captured the same host twice in a single year (which does not occur consistently from year to year.)

(iii) did not aggregate results for hosts that were part of a larger domain. This would have been easy to spot in the case of the larger media organisations such as the Guardian, which has multiple hosts (society,,, etc.) However, it is much harder to do reliably for all such cases without examining individual archived instances, which was not possible at this scale.


(i) that a host “” held the same content as “”.

(ii) that the Internet Archive were no more likely to miss hosts that linked to the Canterbury site than ones that did not – ie., if there are gaps in what the Internet Archive found, there is no reason to suppose that they systematically skew this particular analysis.