When using an archive could put it in danger

Towards the end of 2013 the UK saw a public controversy seemingly made to showcase the value of web archives. The Conservative Party, in what I still think was nothing more than a housekeeping exercise, moved an archive of older political speeches to a harder-to-find part of their site, and applied the robots.txt protocol to the content. As I wrote for the UK Web Archive blog at the time:

“Firstly, the copies held by the Internet Archive (archive.org) were not erased or deleted – all that happened is that access to the resources was blocked. Due to the legal environment in which the Internet Archive operates, they have adopted a policy that allows web sites to use robots.txt to directly control whether the archived copies can be made available. The robots.txt protocol has no legal force but the observance of it is part of good manners in interaction online. It requests that search engines and other web crawlers such as those used by web archives do not visit or index the page. The Internet Archive policy extends the same courtesy to playback.

“At some point after the content in question was removed from the original website, the party added the content in question to their robots.txt file. As the practice of the Internet Archive is to observe robots.txt retrospectively, it began to withhold its copies, which had been made before the party implemented robots.txt on the archive of speeches. Since then, the party has reversed that decision, and the Internet Archive copies are live once again.

Courtesy of wfryer on flickr.com, CC BY-SA 2.0 : https://www.flickr.com/photos/wfryer/

Courtesy of wfryer on flickr.com, CC BY-SA 2.0 : https://www.flickr.com/photos/wfryer/


As public engagement lead for the UK Web Archive at the time, I was happily able to use the episode to draw attention to holdings of the same content in UKWA that were not retrospectively affected by a change to the robots.txt of the original site.

This week I’ve been prompted to think about another aspect of this issue by my own research. I’ve had occasion to spend some time looking at archived content from a political organisation in the UK, the values of which I deplore but which as scholars we need to understand. The UK Web Archive holds some data from this particular domain, but only back to 2005, and the earlier content is only available in the Internet Archive.

Some time ago I mused on a possible ‘Heisenberg principle of web archiving‘ – the idea that, as public consciousness of web archiving steadily grows, the consciousness of that fact begins to affect the behaviour of the live web. In 2012 it was hard to see how we might observe any such trend, and I don’t think we’re any closer to being able to do so. But the Conservative party episode highlights the vulnerability of content in the Internet Archive to a change in robots.txt policy by an organisation with something to hide and a new-found understanding of how web archiving works.

Put simply: the content I’ve been citing this week could later today disappear from view if the organisation concerned wanted it to, and was to come to understand how to make it happen. It is possible, in short, effectively to delete the archive – which is rather terrifying.

In the UK, at least, the danger of this is removed for content published after 2013, due to the provisions of Non-Print Legal Deposit. (And this is yet another argument for legal deposit provisions in every jurisdiction worldwide). In the meantime, as scholars, we are left with the uneasy awareness that the more we draw attention to the archive, the greater the danger to which it is exposed.

New resources at Lambeth Palace Library

As in previous years, a little round-up of newly available resources at Lambeth for historians of the twentieth century, derived as usual from the Annual Review, this time for 2014.

The cataloguing of the main run of Archbishops’ Papers has reached 1984, a year which sees Robert Runcie having to deal with the controversial appointment of David Jenkins as bishop of Durham, and the miners’ strike.

Of particular interest to me are the newly catalogued papers of the Council for Foreign Relations dealing with relations with Roman Catholics in the UK (CFR RC 161-193), from the immediate post-war period until the 1980s. Also from the CFR are the papers relating to Lutheran and Reformed church overseas for the key period from 1933 until 1981. Both the series complement my own work on Michael Ramsey.

For historians of evangelicalism, the cataloguing of the papers of John Stott is also complete, including a substantial amount of printed material.

The manuscripts catalogue may be accessed here.

Welcoming the new Journal of Open Humanities Data

After some months in the making, I am delighted to be able to draw attention to the new Journal of Open Humanities Data. I’m particularly pleased to be a member of the editorial board.

Fully peer-reviewed, JOHD carries “publications describing humanities data or techniques with high potential for reuse.”

The journal accepts two kinds of papers:

“1. Metapapers, that describe humanities research objects with high reuse potential. This might include quantitative and qualitative data, software, algorithms, maps, simulations, ontologies etc. These are short (1000 word) highly structured narratives and must conform to the Metapaper template.

“2. Full length research papers that describe different methods used to create, process, evaluate, or curate humanities research objects. These are intended to be longer narratives (3,000 – 5,000 words) which give authors the ability to describe a research object and its creation in greater detail than a traditional publication.

For more detail, see the JOHD at Ubiquity Press.

Christianity and Religious Plurality: Studies in Church History 51

A recent arrival on the doormat was the latest volume of Studies in Church History, being papers mostly from the Ecclesiastical History Society’s conference in Chichester in 2013. Given the theme of religious plurality, there are rich pickings for scholars of the twentieth century, which isn’t always the case with Studies.

In no particular order, some of the papers of particular interest are:

  • John Wolffe’s presidential address to the conference on the Christian response to religious minorities in London since 1800;
  • Marion Bowman on plurality and vernacular religion in early twentieth century Glastonbury;
  • Martin Wellings on James Hope Moulton’s 1913 book Religions and Religion;
  • Stuart Mews on a Christian-Hindu encounter in the University of London (1909-17);
  • John Maiden on a fascinating contested church building redundancy in Bedford in 1977-8; and
  • my own paper on Michael Ramsey and his encounter with other faiths (of which there is an extended summary).

As well as these, there are papers on twentieth century Egypt, Indonesia, Lebanon and Jerusalem, as well as on the Chaldean Catholic Church in modern Iraq.

 

Dramatic adaptations of James Joyce’s ‘Dubliners’ in 1960s Belfast

Scholars of James Joyce (one of which I am assuredly not) may be interested in a chance discovery in the archival collections of the National University of Ireland Galway. Obscured by an incomplete catalogue record is the existence of adaptations for the stage of three of the stories in Joyce’s Dubliners, one of which at least was produced by the Lyric Players in Belfast in March 1963.

File T4/75 in the Lyric Theatre/O’Malley archive is catalogued as concerning a triple-bill production of plays by W.B. Yeats, J.M.Synge, and Lady Augusta Gregory. On examination of the file, the programme states that the production was in fact of four plays rather than three. The fourth was an adaptation of ‘Grace’, one of the stories in Dubliners, made by Maureen O’Farrell and James O’Connor. In the same file there is a script of the same that establishes the point. O’Farrell (later Maureen Charlton) was involved in the Belfast theatrical scene, and adapted Synge’s Playboy of the Western World as a musical. The file also contains a number of photographs of the production of ‘Grace’.

In the same file there is a second script, typed on the same yellow paper, with a missing first page. This appears to be a similar adaptation of ‘Ivy Day in the Committee Room,’, also from Dubliners. However, it doesn’t seem to have been produced, although it was presumably considered.

If my identification is correct, it also makes sense of file T4/432 in the same archive, which contains a third adaptation in the same typescript on the same yellow paper of ‘The Dead’, a third Dubliners story. The catalogue records this as of an unknown adaptor, although it seems likely that this was also the work of O’Farrell and O’Connor.

It may be that these adaptations are well known to Joyce scholars; but I record them here in case they are not.

Why hoping private companies will just do the Right Thing doesn’t work

In the last few weeks I’ve been to several conferences on the issue of the preservation of online content for research, and in particular social media. This is an issue that is attracting a lot of attention at the moment: for examples, see Helen Hockx-Yu’s paper for last year’s IFLA conference, or the forthcoming TechWatch report from the Digital Preservation Coalition. As I myself blogged a little while ago, and (obliquely) suggested in this presentation on religion and social media, there’s growing interest from social scientists in using social media data – most typically Twitter or Facebook – to understand contemporary social phenomena. But whereas users of the archived web (such as myself) can rely on continued access to the data we use, and can expect to be able to point to that data such that others may follow and replicate our results, this isn’t the case with social media.

Commercial providers of social media platforms impose several different kinds of barriers: These can include: limits on the amount of data that may be requested in any one period of time; provision of samples of data created by proprietary algorithms which may not themselves be scrutinised; limits on how much of and/or which fields in a dataset may be shared with other researchers. These issues are well-known, and aren’t my main concern here. My concern is with how these restrictions are being discussed by scholars, librarians and archivists.

I’ve noticed an inability to imagine why it is that these restrictions are made, and as a result, a struggle to begin to think what the solutions might be. There has been a similar trend amongst the Open Access community, to paint commercial academic publishers as profit-hungry dinosaurs, making money without regard to the public good element of scholarly publishing happens. Regarding social media, it is viewed as simply a failure of good manners when a social media firm shuts down a service without providing for scholarly access to its archive, or does not allow free access to and reuse of its data to scholars. Why (the question is implicitly posed) don’t these organisations do the Right Thing? Surely everyone thinks that preserving this stuff is worthwhile, and that it is a duty of all providers?

But private corporations aren’t individuals, endowed with an idea of duty and a moral sense. Private corporations are legal abstractions: machines designed for the maximisation of return on capital. If they don’t do the Right Thing, it isn’t because the people who run them are bad people. No; it’s because the thing we want them to do (or not do) impacts adversely on revenue, or adds extra cost without corresponding additional revenue.

Fundamentally, a commercial organisation is likely to shut down an unprofitable service without regard to the archive unless (i) providing access to the archive is likely to yield research findings which will help future service development, or; (ii) it causes positive harm to the brand to shut it down (or helps the brand to be seen *not* to do so.) Similarly, they are unlikely to incur costs to run additional services for researchers, or to share valuable data unless (again) they stand to gain something from the research, however obliquely, or by doing so they either help or protect the brand.

At this point, readers may despair of getting anywhere in this regard, which I could understand. One way through this might be an enlargement of the scope of legal deposit legislation such that some categories of data (politicians’ tweets, say, given the recent episode over Politwoops) are deemed sufficiently significant to be treated as public records. There will be lobbying against, surely, but once such law is passed, companies will adapt business models to a changed circumstance, as they always have done. An even harder task is so to shift the terms of public discourse such that a publicly accessible record of this data is seen by the public as necessary. Another way is to build communities of researchers around particular services such that generalisable research about a service can be absorbed by the providers, thus showing that openness with the data leads to a gain in terms of research and development.

All of these are in their ways Herculean tasks, and I have no blueprint for them. But recognising the commercial realities of the situation would get us further than vague pieties about persuading private firms to do the Right Thing. It isn’t how they work.

Lecture at NUI Galway: ‘Prospects and pitfalls in web archives for research’

Some details of my public lecture at the National University of Ireland Galway in a couple of weeks:

Date:  Tuesday June 23rd, 3pm
Venue:  Moore Institute Seminar Room, G010, Hardiman Research Building ( map )
Title:   ‘A new class of scholarly resource? Prospects and pitfalls in using web archives for research’
Abstract:  Viewed globally, the process of archiving the web and providing access to that archive is some way ahead of scholars’ and archivists’ understanding of the uses scholars will make of this new class of resource. This lecture will make the case that scholars of contemporary life have a stake in the successful archiving of the web, and in helping determine its shape. After then examining the current state of web archiving in the UK and Ireland, it will present the results of recent and ongoing research into religious discourse in the British and Irish web domains, presenting both substantive conclusions and proposing methods and approaches that are more widely applicable to other scholarly issues.
[Update: the slides are now available on Slideshare ]