I am currently working on a chapter contribution to the forthcoming Sage Handbook to Web History, edited by Megan Sapnar Ankerson, Niels Brugger and Ian Milligan. Although the inclusion of the paper is subject to peer review, here’s my abstract. It should appear some time in late 2017.
“This chapter seeks both to assess the state of current scholarship on online religion, and to suggest potential directions for future research. There are now 20 years of research in the field of Internet Studies in relationship to religious organisations, faith and practice. However, it is less clear that this body of work yet represents a specifically historical inquiry about religion on the Web, although it will in many cases provide the foundation of such work. Much of the research to date has concentrated on the nature of emerging communities of individuals: communities that were either an alternative or a supplement to face-to-face relations in particular localities. This chapter draws out trends emerging in this scholarship over the 25 years of Web history, as the affordances of the Web have developed. Attention has also been paid to the balance of institutional authority and individual self-expression in a religious space that is unregulated, or at least that must be regulated in new ways. The chapter asks how far this scholarship may be integrated into wider histories of offline religious authority and practice, which have themselves undergone shifts and transformations of perhaps equal significance.
“Rather less prominent in the literature so far is the institutional history of religion. Making use of the archived Web in particular, the chapter sketches the outline of a new area of inquiry: the evolution of the religious web sphere, both as a global whole, within each of the global religions and denominations, and at a national level. To what degree has the nature of the Web, a decentralised international network system which contrasts with the hierarchical nature of most religious organisations, moulded the religious web sphere into a different shape? Early studies in this area have suggested that, in certain key ways, the religious web sphere can be read as a reimplementation of older structures of influence, attention and esteem that were visible before, and remain visible offline. Insofar as the religious web does not mirror the traditional offline structure of religious organisations, the chapter also reflects on how far this changed shape may be accounted for by broader trends in religious history, in a period of rapid change. How far does it relate to the recent history of religion in the media more generally?
“At a more abstract level, the chapter will attend to the degree to which the myths of the Web, and indeed of the whole Internet – of a pluralistic, idealistic, liberating force with an agency of its own – have shaped understandings of the Web’s religious history. It examines how far the last quarter century has really been a period of rupture and discontinuity, and how much has in fact stayed the same, or continued on a path on which it was set before the Web appeared. It will also assess how far the field has so far been focussed to excess on the new, to the neglect of understanding the histories of how practices and technologies that were once new become mainstream.
Details of a lecture I shall give next week:
Title: Doing (very) contemporary history with the archived Web: Rowan Williams, archbishop of Canterbury, and the sharia law controversy of 2008
Date: Thursday, 9th June, 1pm
Venue: Weston Library Lecture Theatre, University of Oxford
Booking details: booking is advisable but not essential. It’s free.
Abstract: The decade following the turn of the millennium may have seen an epochal shift in the nature of the discussion of religion in public life in the UK. The 9/11 attacks in the USA, and the terrorist bombings in London in 2005 prompted an outpouring of anxiety concerning the place of Islam in British society. The period also saw the coming to prominence of the ‘New Atheism’ associated with figures such as Richard Dawkins and Christopher Hitchens. The uniquely privileged position of Christianity, and the Church of England in particular, was also under greater scrutiny than had been the case for decades.
This paper examines a crucial episode of public controversy closely connected to each of these trends: a lecture given in 2008 by Rowan Williams, archbishop of Canterbury, on the accommodation of Islamic sharia law into British law. Using archived web content from the UK Web Archive, held by the British Library, it examines the controversy as it played out on the UK web. It argues that the episode prompted a step-change in both the levels of attention paid to the archbishop’s web domain, and a broadening of the types of organisation which took notice of him. At the same time, it also suggests that the historic media habit of privileging the public statements of the archbishop over those of any other British faith leader was extended onto the web.
The paper uses techniques of both close and distant reading: on the one hand, aggregate link analysis of the whole .uk web domain, and on the other hand, micro analysis of individual domains and pages. In doing so, it demonstrates some of the various ways in which contemporary historians will very soon need to use the archived web to address older questions in a new way, in a new context of super-abundant data.
A theme that emerged for me in the IIPC web archiving conference in Reykjavik last week was metadata, and specifically: precisely which metadata do users of web archives need in order to understand the material they are using?
At one level, a precise answer to this will only come from sustained and detailed engagement with users themselves; research which I would very much hope that the IIPC would see as part of its role to stimulate, organise and indeed fund. But that takes time, and at present, most users understand the nature of the web archiving process only rather vaguely. As a result, I suspect that without the right kind of engagement, scholars are likely (as Matthew Weber noted) to default to ‘we need everything’, or if asked directly ‘what metadata do you need?’ may well answer ‘well, what do you have, and what would it tell me?’
During my own paper I referred to the issue, and was asked by a member of the audience if I could say what such enhanced metadata provision might look like. What I offer here is the first draft of an answer: a five-part scheme of kinds of metadata and documentation that may be needed (or at least, that I myself would need). I could hardly imagine this would meet every user requirement; but it’s a start.
At the very broadest level, users need to know something of the history of the collecting organisation, and how web archiving has become part of its mission and purpose. I hope to provide a overview of aspects of this on a world scale in this forthcoming article on the recent history of web archiving.
2. Domain or broad crawl
Periodic archiving of a whole national domain under legal deposit provisions now offers the prospect of the kind of aggregate analysis that takes us way beyond single-resource views in Wayback. But it becomes absolutely vital to know certain things at a crawl level. How was territoriality determined – by ccTLD, domain registration, Geo-IP lookup, curatorial decision? The way the national web sphere is defined fundamentally shapes the way in which we can analyse it. How big was the crawl in relation to previous years? How many domains are new, and how many have disappeared? What’s the policy on robots.txt (by default) ? How deep was the crawl scope (by default)? Was there a data cap per host? Some of this will already be articulated in internal documents, some will need some additional data analysis; but it all goes to the heart of how we might read the national web sphere as a whole.
3. Curated collection level
Many web archives have extensive curated collections on particular themes or events. These are a great means of showcasing the value of web archives to the public and to those who hold the pursestrings. But if not transparently documented they present some difficulties to the user trying to interpret them, as the process introduced a level of human judgment to add to the more technical decisions that I outlined above. In order to evaluate the collection as a whole, scholars really do need to know the selection criteria, and at a more detailed level than is often provided right now. In particular, in cases where permissions were requested for sites but not received, being able to access the whole list of sites selected rather than just those that were successfully archived would help a great deal in understanding the way in which a collection was made.
4. Host/domain level
This is the level at which a great deal of effort is expended to create metadata that looks very much like a traditional catalogue record: subject keywords, free-text descriptions and the like. For me, it would be important to know when the first attempt to crawl a host was, and the most recent, and whether there were 404 responses received for crawl attempts at any time in between. Was this host capped (or uncapped) at the discretion of a curator differentially to the policy for a crawl as a whole? Similarly, was the crawl scoping different, or the policy on robots.txt? If the crawl incorporates a GeoIP check, what was the result? Which other domains has it redirected to, and which redirect to it, and which times?
5. Individual resource level
Finally, there are some useful things to know about individual resources. As at the host level, information about the date of the first and last attempts to crawl, and about intervening 404s, would tell the user useful things about what we might call the career of a resource. If the resource changes, what is the profile of that: for instance, how has the file size changed over time? Were there other captures which were rejected, perhaps on a QA basis, and if so, when?
Much if not quite all of this could be based on data which is widely collected already (in policy documents, or curator tools, crawl logs or CDX) or could be with some adjustment. It presents some very significant GUI design challenges in how best to deliver these data to users. Some might be better delivered as datasets for download or via an API. What I hope to have provided, though, is a first sketch of an agenda for what the next generation of access services might disclose, that is not a default to ‘everything’ and is feasible given the tools in use.
This week I’m writing the first draft of a chapter on the cultural history of web archiving, for a forthcoming volume of essays (details here). It is subject to peer review and so isn’t yet certain to be published, but here’s the abstract.
I should welcome comments very much, and there may also be a short opportunity for open online peer review.
Users, technologies, organisations: towards a cultural history of world web archiving
‘As systematic archiving of the World Wide Web approaches its twentieth anniversary, the time is ripe for an initial historical assessment of the patterns in which web archiving has fallen. The scene is characterised by a highly asymmetric pattern, involving a single global organisation, the Internet Archive, alongside a growing number of national memory institutions, many of which are affiliated to the International Internet Preservation Consortium. Many other organisations also engage in archiving the web, including universities and other institutions in the galleries, libraries, archives and museums sector. Alongside these is a proliferation of private sector providers of web archiving services, and a small but highly diverse group of individuals acting on their own behalf. The evolution of this ecosystem, and the consequences of that evolution, are ripe for investigation.
‘Employing evidence derived from interviews and from published sources, the paper sets out to document at length for the first time the development of the sector in its institutional and cultural aspects. In particular it considers how the relationship between archiving organisations and their stakeholders has played out in different circumstances. How have the needs of the archives themselves and their internal stakeholders and external funders interacted with the needs of the scholarly end users of the archived web? Has web archiving been driven by the evolution of the technologies used to carry it out, the internal imperatives of the organisations involved, or by the needs of the end user?
I think there would be general agreement amongst web archivists that the country code top-level domain alone is not the whole of a national web. Implementations of legal deposit for the web tend to rely at least in part on the ccTLD (.uk, or .fr) as the means of defining their scope, even if supplemented by other means of selection.
However, efforts to understand the scale and patterns of national web content that lies outside national ccTLDs are in their infancy. An indication of the scale of the question is given by a recent investigation by the British Library. The @UKWebArchive team found more than 2.5 million hosts that were physically located in the UK without having .uk domain names. This would suggest that as much as a third of the UK web may lie outside its ccTLD.
And this is important to scholars, because we often tend to study questions in national terms – and it is difficult to generalise about a national web if the web archive we have is mostly made up of the ccTLD. And it is even more difficult if we don’t really understand how much national content there is outside that circle, and also which kinds of content are more or less likely to be outside the circle. Day to day, we can see that in the UK there are political parties, banks, train companies and all kinds of other organisations that ‘live’ outside .uk – but we understand almost nothing about how typical that is within any particular sector. We also understand very little about what motivates individuals and organisations to register their site in a particular national space.
So as a community of scholars we need case studies of particular sectors to understand their ‘residence patterns’, as it were: are British engineering firms (say) more or less likely to have a web domain from the ccTLD than nurseries, or taxi firms, or supermarkets? And so here is a modest attempt at just such a case study.
All the mainstream Christian churches in the island of Ireland date their origins to many years before the current political division of the island in 1921. As such, all the churches are organised on an all-Ireland basis, with organisational units that do not recognise the political border. In the case of the Church of Ireland (Anglican), although Northern Ireland lies entirely within the province of Armagh (the other province being Dublin), several of the dioceses of the province span the border, such that the bishop must cross the political border on a daily basis to minister to his various parishes.
How is this reflected on the web? In particular, where congregations in the same church are situated in either side of the border, where do their websites live – in .uk, or in .ie, or indeed in neither?
I have been assembling lists of individual congregation websites as part of a larger modelling of the Irish religious webspace, and one of these is the Presbyterian Church of Ireland. My initial list contains just over two hundred individual church sites, the vast majority of which are in Northern Ireland (as is the bulk of the membership of the church). Looking at Northern Ireland, the ‘residence pattern’ is:
.co.uk – 23%
.org.uk – 20%
.com – 17%
.org – 37%
Other – 3%
In sum, less than half of these sites – of church congregations within the United Kingdom – are ‘resident’ within the UK ccTLD. A good deal of research would need to be done to understand the choices made by individual webmasters. However, it is noteworthy that, for Protestant churches in a part of the world where religious and national identity are so closely identified, to have a UK domain seems not to be all that important.
1. My initial list (derived from one published by the PCI itself) represents only sites which the central organisation of the denomination knew existed at the time of compilation, and there are more than twice as many congregations as there are sites listed. However, it seems unlikely that that in itself can have skewed the proportions.
2. For the very small number of PCI congregations in the Republic of Ireland (that appear in the list), the situation is similar, with less than 30% of churches opting for a domain name within the .ie ccTLD. However, the number is too small (26 in all) to draw any conclusions from it.
While working on another project, I’ve had occasion to make some data relating to the blog aggregator site britishblogs.co.uk (now apparently defunct) which occurs in the Internet Archive between 2006 and 2012. I am unlikely to exploit it very much myself, and so I have made it available in figshare, in case it should be of use to anyone else.
Specifically, it is data derived from the UK Host Link Graph, which states the presence of links from one host to another in the JISC UK Web Domain Dataset (1996-2010), a dataset of archived web content for the UK country code top level domain captured by the Internet Archive.
It has 19,423 individual lines, each expressing one host-to-host linkage in content from a single year.
Since the blog as a format seems to be particularly prone to disappearance over time, scholars of the British blogosphere may find this useful in locating now defunct blogs in the Internet Archive or elsewhere. My sense is that the blogs included in the aggregator were nominated by their authors as being British, and so this may be of some help in identifying British content in platforms such as WordPress or Blogger.
Some words of caution. The data is offered very much as-is, without any guarantees about robustness or significance. In particular:
(i) I have made no effort to de-duplicate where the Internet Archive crawled the site, or parts of it, more than once in a single year.
(ii) also present are a certain number of inbound links – that is to say, other hosts linking to britishblogs.co.uk. However, these are very much the minority.
(iii) there is also some analysis needed in understanding which links are to blogs, and which are to content linked to from within those blogs (and aggregated by British Blogs).
Towards the end of 2013 the UK saw a public controversy seemingly made to showcase the value of web archives. The Conservative Party, in what I still think was nothing more than a housekeeping exercise, moved an archive of older political speeches to a harder-to-find part of their site, and applied the robots.txt protocol to the content. As I wrote for the UK Web Archive blog at the time:
“Firstly, the copies held by the Internet Archive (archive.org) were not erased or deleted – all that happened is that access to the resources was blocked. Due to the legal environment in which the Internet Archive operates, they have adopted a policy that allows web sites to use robots.txt to directly control whether the archived copies can be made available. The robots.txt protocol has no legal force but the observance of it is part of good manners in interaction online. It requests that search engines and other web crawlers such as those used by web archives do not visit or index the page. The Internet Archive policy extends the same courtesy to playback.
“At some point after the content in question was removed from the original website, the party added the content in question to their robots.txt file. As the practice of the Internet Archive is to observe robots.txt retrospectively, it began to withhold its copies, which had been made before the party implemented robots.txt on the archive of speeches. Since then, the party has reversed that decision, and the Internet Archive copies are live once again.
As public engagement lead for the UK Web Archive at the time, I was happily able to use the episode to draw attention to holdings of the same content in UKWA that were not retrospectively affected by a change to the robots.txt of the original site.
This week I’ve been prompted to think about another aspect of this issue by my own research. I’ve had occasion to spend some time looking at archived content from a political organisation in the UK, the values of which I deplore but which as scholars we need to understand. The UK Web Archive holds some data from this particular domain, but only back to 2005, and the earlier content is only available in the Internet Archive.
Some time ago I mused on a possible ‘Heisenberg principle of web archiving‘ – the idea that, as public consciousness of web archiving steadily grows, the consciousness of that fact begins to affect the behaviour of the live web. In 2012 it was hard to see how we might observe any such trend, and I don’t think we’re any closer to being able to do so. But the Conservative party episode highlights the vulnerability of content in the Internet Archive to a change in robots.txt policy by an organisation with something to hide and a new-found understanding of how web archiving works.
Put simply: the content I’ve been citing this week could later today disappear from view if the organisation concerned wanted it to, and was to come to understand how to make it happen. It is possible, in short, effectively to delete the archive – which is rather terrifying.
In the UK, at least, the danger of this is removed for content published after 2013, due to the provisions of Non-Print Legal Deposit. (And this is yet another argument for legal deposit provisions in every jurisdiction worldwide). In the meantime, as scholars, we are left with the uneasy awareness that the more we draw attention to the archive, the greater the danger to which it is exposed.