Web Archiving 101: a new course and podcast

It was a great pleasure to work with former colleagues at the University of London Computing Centre this week to deliver Web Archiving 101, a day course which forms part of ULCC’s highly successful Digital Preservation Training Programme. My thanks to Steph Taylor and Ed Pinsent, my fellow trainers, and also Sara Day Thomson from the Digital Preservation Coalition who taught the module on social media archiving.

To get a taste of the day, see the programme. A few days before, Ed, Steph and I also had a very enjoyable talk about the issues the course would raise: it’s available as a podcast in Soundcloud.

Understanding the web of faith: forthcoming book chapter

I’m very pleased to say that an essay of mine has been accepted for a forthcoming volume: The Web as History: the first two decades. It is edited by Niels Brügger and Ralph Schroeder, and will appear Open Access with UCL Press in 2016.

Here’s my abstract:

‘Much of the discourse that historians of contemporary religion until recently tracked in correspondence, periodical publication and print ephemera has migrated online. But the task of understanding religious discourse in the UK web space has hardly begun. The task is hard to undertake at the highest level since there are no second-level domains that serve as useful units of analysis — there is no faith.uk to match nhs.uk or ac.uk.

‘This chapter represents a first step towards understanding the evolution of the UK religious web space, by means of two interrelated case studies, which between them point to the agenda and content of a larger research project. Both studies utilise the JISC UK Web Domain Dataset for the period 1996-2008, as held by the British Library.

‘Firstly, it will examine the web archive footprint left by the public controversy in 2008 over the comments made by Rowan Williams, archbishop of Canterbury, on the matter of sharia law. Using both the link graph and a direct qualitative analysis of archived content, it will explore both the shape and the content of the controversy and show the degree to which religious debate had not only migrated from print to the web, but in doing so had engaged different actors and lost others, and changed in its tone.

‘Secondly, it will consider the growing tension in religious discourse between faith groups and organisations with a secularist agenda. Again, using the link graph and some qualitative analysis, it will explore the patterns in which linkages grew and shifted between the web estates of key but opposed organisations in relation to issues including faith schools and creationism, the reform of the law on blasphemy, and the place of the bishops in the House of Lords.

Will historians of the future be able to study Twitter?

Over the last year or so, the IHR Digital History seminar has become increasingly web-focussed, which is of course of interest to me (if not necessarily to everyone.) Last week we had an excellent paper from Jack Grieve of Aston University on the tracking of newly emerging words as they appeared in large corpora of tweets from the UK and the US. By amassing very large tweet datasets, he and his colleagues are able to observe the early traces of newly emerging words, and also (when those tweets were submitted from devices which attach geo-references) to see where those new words first appear, and how they spread. Jack and his colleagues are finding that words quite often emerge first (in the US) in the east and south-east (or California) and then spread towards the centre of the continent. They don’t necessarily spread in even waves across space, or even spring between urban centres and then to rural areas (as would have been my uneducated guess). Read more at the project site, treets.net, or watch the paper.

This kind of approach is quite impossible without the kind of very large-scale natural language data such as social media afford. This is particularly so as most words are (perhaps counter-intuitively) rather rare. In the corpus in question, the majority of the 67,000 most common words appear only once in 25 million words. Given this, datasets of billions of tweets are the minimum size necessary to be able to see the patterns.

It was interesting to me as a convenor to see the rather different spread of people who came to this paper, as opposed to the more usual digital history work the seminar showcases. Jack focussed on tweets posted since 2013; a time span that even the most contemporary historian would struggle to call their own; and so not so many of them came along – but we had perhaps our first mathematician instead. This was a shame, as Jack’s paper was a fascinating glimpse into the way that historical linguistics, and indeed other types of historical enquiry, might look in a couple of decades’ time.

But there is a caveat to this, which was beyond the scope of Jack’s paper, to do with the means by which this data will be accessible to scholars of 2014 working in (say) 2044. Jack and his colleagues work directly from the so-called Twitter “firehose”; they harvest every tweet coming from the Twitter API, and (on their own hardware) process each tweet and discard those that are not geo-coded to within the study area. This kind of work involves considerable local computing firepower, and (more importantly) is concerned with the now. It creates data in real time to answer questions of the very recent past.

Researchers working in 2044 and interested in 2014 may well be able to re-use this particular bespoke dataset (assuming it is preserved – a different matter of research data management, for another post sometime). However, they may equally well want to ask completely different questions, and so need data prepared in a quite different way. Right now, the future of the vast ocean of past tweets is not certain; and so it is not clear whether the scholar of 2044 will be able to create their own bespoke subset of data from the archive. The Library of Congress, to be sure, are receiving an archive of data from Twitter; but the access arrangements for this data are not clear, and (at present) are zero. So, in the same way that historians need to take some ownership of the future of the archived web, we need to become much more concerned about the future of social media: the primary sources that our graduate students, and their graduate students in turn, will need to work with two generations down the line.

Certainly, historians have always been used to working around and across the gaps in the historical record; it’s part of the basic skillset, to deal with the fragmentary survival of the record. But there is right now a moment in which major strategic decisions are to be made about that survival, and historians need to make themselves heard.

[This post also appears on the IHR Digital History Seminar blog.]

Religion, social media and the web archive

Late last year I was delighted to be invited to be one of four keynote speakers at a workshop on religion and social media at the International AAAI Conference on Web and Social Media in Oxford in May. Here are some initial thoughts on what I intend to say.

There has been an interesting upswing recently in scholarly interest in the ways in which religious people, and the organisations in which they gather together, represent themselves and communicate with others on social media. However, this work has been conducted relatively independently from the emerging body of scholarship on the archived web.

There are some reasons for this. First is the fact that much of the scholarship on social media tends to be focussed very firmly on the present. As such, data tends to be gathered directly from social media platforms “to order”, to match the particular research questions in view, and does not engage the various web archives that are in existence, whether at national libraries or the Internet Archive.

The second reason (which may indeed be the more important) is that traditional web archiving has limited success in archiving social media content. There are several well-documented reasons for this, not least the significant technical difficulties in capturing the content as it is presented in user interfaces such as that for Twitter or YouTube. Also, the data gathered is wrapped up in its presentation layer, rather than being neatly organised as a dataset for analysis. Aside from these technical challenges, the very social nature of social media – with multiple content creators co-existing and interacting on the same platform – adds considerable complexity to the task of the web archivist of determining which content can be archived under existing legal deposit frameworks.

So much for the reasons; but this gap between social media research and the archived web needs to be closed, because part of the story is missed. If we want to understand the evolution of the engagement of churches with social media, then we need to understand the ways in which traditional church websites integrated social media content within themselves, and from what point in time. As well as this, we need to be able to understand the content to which social media users were referring and linking – content which will increasingly often be found only in web archives as it disappears from the live web.

In Oxford, I shall be presenting some small case studies in the development of the web and social media presence of local churches, individuals and national church bodies in England and in Ireland. How quickly did churches begin to integrate their social media channels with their websites – which is to ask, at which point did social media become central to their communication strategies ? This is enabled by data made available from the British Library which covers the period from 1996 until 2013; the period in which social media grew from nothing to the prominence it now holds.

Method in the web archive for the arts and humanities: a conference report

[In early December 2014 the Big UK Domain Data for the Arts and Humanities project held an excellent day conference on the theme of web archives as big data. A good part of the day was taken up with short presentations from the project’s bursary holders, arts and humanities scholars all, reflecting both on their substantive research findings, the experience of using the prototype user interface (developed by the BL) and on web archives as source material in general.
In early 2015 these results will appear on the BUDDAH project blog as a series of reports. This post reflects on some common methodological themes that emerged during the course of the day. A version of this was also posted on the project blog. Details of the projects are to be found also on the BUDDAH blog.]

Perhaps the single most prominent note of the whole day was of the sheer size of the archive. “Too much data!” was a common cry heard during the project, and with good reason, since there are few other archives in common use with data of this magnitude, at least amongst those used by humanists. In an archive with more than 2 billion resources recorded in the index, the researchers found that queries needed to be a great deal more specific than most users are accustomed to; and that even the slightest ambiguity in the choice of search terms in particular led very quickly to results sets containing many thousands of results. Gareth Millward (@MillieQED) also drew attention to the difficulties in interpreting patterns in the incidence of any but the most specific search terms across time across the whole dataset, since almost any search term a user can imagine may have more than one meaning in an archive of the whole UK web.

One common strategy to come to terms with the size of the archive was to “think small”: to explore some very big data by means of a series of small case studies, which could then be articulated together. Harry Raffal, for example, focussed on a succession of captures of a small set of key pages in the Ministry of Defence’s web estate; Helen Taylor on a close reading of the evolution of the content and structure of certain key poetry sites as they changed over time. This approach had much in common with that of Saskia Huc-Hepher on the habitus of the London French community as reflected in a number of key blogs. Rowan Aust also read important things from the presence and absence of content in the BBC’s web estate in the wake of the Jimmy Saville scandal.

An encouraging aspect of the presentations was the methodological holism on display, with this particular dataset being used in conjunction with other web archives, notably the Internet Archive. In the case of Marta Musso’s work on the evolution of the corporate web space, this data was but one part of a broader enquiry employing questionnaire and other evidence in order to create a rounded picture.

One particular and key difference between this prototype interface and other familiar services is that search results in the UI are not prioritised by any algorithmic intervention, but are presented in the archival order. This brought into focus one of the recurrent questions in the project: in the context of superabundant data, how attached is the typical user to a search service that (as it were) second-guesses what it was that the user *really* wanted to ask, and presents results in that order? If such a service is what is required, then how transparent must the operation of the algorithm be in order to be trusted ? Richard Deswarte (@CanadianRichard) powerfully drew attention to how fundamental has been the effect of Google on user expectations of the interfaces they use.

Somewhat surprisingly (at least for me), more than one of the speakers was prepared to accept results without such machine prioritisation: indeed, in some senses it was preferable to be able to utilise what Saskia Huc-Hepher described as the “objective power of arbitrariness”. If a query produced more results than could be inspected individually, then both Saskia and Rona Cran were more comfortable with making their own decisions about taking smaller samples from those results than relying on a closed algorithm to make that selection. In a manner strikingly akin to the functionality of the physical library, such arbitrariness also led on occasion to a creative serendipitous juxtaposition of resources: a kind of collage in the web archive.

Understanding the shape of the Anglo-Irish web: a pilot project

I’m delighted to be able to say that I shall be a Visiting Research Fellow at the Moore Institute of the National University of Ireland at Galway in 2015. Here are some details of what I plan to get up to.

The task of understanding what constitutes the nation in the web archive is only in its infancy. Web archivists in national libraries have long known that top-level domains such as .uk or .ie do not encompass all the content that should be considered British or Irish for the purposes of analysis. But even the task of understanding the shape of those top-level domains has only just begun. My project begins that process for the Irish web.

One of the live questions about the nature of the national web is the degree to which it interacts with other national domains. This is of particular interest in the Irish context, since many institutions on the island of Ireland interact in cyberspace in ways that do not respect the physical and political border between Northern Ireland and the Republic.

This pilot study will begin to examine this interaction by the triangulation of analyses of data available from the Internet Archive and from the British Library. In particular, the data from the British Library lists all of the outbound links in the .uk webspace for the period 1996-2010 (see this earlier post). Such a dataset does not exist for the Irish webspace, but by analysing the composition of links from .uk sites to those in the .ie domain, it will be possible to read the growth and composition of the Irish webspace in its reflection in the UK. It will also shed valuable and hitherto unseen light on one aspect of the relation between the UK and the Republic of Ireland.

The initial outputs will be a series of small case studies, documented on this blog. Over time, these will be synthesised into an appropriate article or articles. I also plan to make subsets of the data available for reuse by other scholars.

The Big UK Domain Data for the Arts and Humanities project has shown an appetite amongst humanities and social sciences scholars to understand the content of web archives, and also to understand the methodological implications of working with what amounts to a new class of primary source. I intend to use the period of the Visiting Fellowship to engage with scholars across the humanities and social sciences at NUI Galway and in other Irish universities, with a view to sowing the seeds of a community of scholars interested in exploring the archive of the Irish webspace.

Reading creationism in the web archive

In recent years, anti-evolutionist thinking has attracted some attention in the news, mostly because of the role of some Christian free schools in teaching anti-evolutionist ideas alongside or in place of evolution. Anti-evolutionist ideas are however by no means new, and have been a durable minority view in some of the churches, picking up speed from the 1960s onwards. (Although the term ‘creationism’ is colloquially used to cover all the particular variants of this thinking, I use the more general term ‘anti-evolutionist’ here.)

It is not always easy to gauge the strength of the movement, but the archived UK web allows a new angle of view on the question. In theory, the web allows minority views to flourish in proportion with their intrinsic attractiveness and plausibility, no longer constrained by the high barriers to entry to traditional publishing. And in the absence of publicly available web usage statistics for the main sites, it is possible to analyse the structure of links to these sites as a proxy measure of attention (both positive and negative.)

Using the Host Link Graph dataset, available from the British Library, I extracted all the unique hosts that had been found linking to any one of four prominent anti-evolutionist sites at any point between 1996 and 2010. Then, using both the live web and of the Internet Archive’s interface at http://archive.org, I examined each host in order to categorise it, which I was able to do for 91% of the results. One immediate point to note is precisely how many “false” results there are. A large proportion of the hosts (34%) are categorised as Other, most of which were links associated with search engine and other directory-type sites, rather than from any host representing an autonomous actor in the field. Excluding these as well, the analysis of the remainder is shown below:


Of the remainder, 39% are the sites of individual congregations. A full analysis of these sites (39 in total) is yet to be done, but the majority are independent evangelical churches, with a handful of Baptist churches. They include very few indeed from Anglican, Roman Catholic or Methodist congregations. Given that at the time of writing the Evangelical Alliance has a membership of 3,500 individual congregations, the magnitude of these numbers suggests that anti-evolutionism is a minority view even amongst evangelical churches.

As might be expected, a significant proportion (17%) are other anti-evolutionist sites; a later post will explore the nature of this particular network. Interestingly, few inbound links are from secularist organisations, other than the British Centre for Science Education which exists to document (and counter) creationist ideas. Once data is available for the period after 2010, it may be that this interest grows as the schools controversy mounts. There are also very few links in from the mainstream media, which might also be expected to grow after 2010.

A complaint often heard from anti-evolutionists is that the scientific “establishment” does not engage with the critique of evolution which is being offered. That claim would seem to be confirmed here, as both the proportion and absolute number of inbound links from academic domains are also very small.

In sum, this data would suggest that between 1996 and 2010, British creationism was talking largely to itself, and was mostly ignored by academia, the media and most of the churches.

You can download the data, which is in the public domain, from here . Be sure to have plenty of hard disk space as, when unzipped, the data is more than 120GB. The data looks like this:

2010 | churchtimes.co.uk | archbishopofcanterbury.org | 20

which tells you that in 2010, the Internet Archive captured 20 individual resources (usually, although not always, “pages”) in the Church Times site that linked to the archbishop of Canterbury’s site.


(i) that a host “abc.co.uk” held the same content as “www.abc.co.uk”.

(ii) that the Internet Archive were no more likely to miss hosts that linked to these sites than ones that did not – ie., if there are gaps in what the Internet Archive found, there is no reason to suppose that they systematically skew this particular analysis.

(iii) that my sample of four target sites was reasonably representative of the movement as a whole. It is therefore possible that the profile of inbound links is very different for another hosts of the same type.

(iv) the analysis does not include cases where a site moved from one host to another during the time period. The host URLs used are those in current use, and so if another host linked to a previous host and that link was not subsequently updated, then that linkage will not be recorded in this data.

(iv) that the inconsistency in deduplication at the British Library noted here does not affect this analysis.