When using an archive could put it in danger

Towards the end of 2013 the UK saw a public controversy seemingly made to showcase the value of web archives. The Conservative Party, in what I still think was nothing more than a housekeeping exercise, moved an archive of older political speeches to a harder-to-find part of their site, and applied the robots.txt protocol to the content. As I wrote for the UK Web Archive blog at the time:

“Firstly, the copies held by the Internet Archive (archive.org) were not erased or deleted – all that happened is that access to the resources was blocked. Due to the legal environment in which the Internet Archive operates, they have adopted a policy that allows web sites to use robots.txt to directly control whether the archived copies can be made available. The robots.txt protocol has no legal force but the observance of it is part of good manners in interaction online. It requests that search engines and other web crawlers such as those used by web archives do not visit or index the page. The Internet Archive policy extends the same courtesy to playback.

“At some point after the content in question was removed from the original website, the party added the content in question to their robots.txt file. As the practice of the Internet Archive is to observe robots.txt retrospectively, it began to withhold its copies, which had been made before the party implemented robots.txt on the archive of speeches. Since then, the party has reversed that decision, and the Internet Archive copies are live once again.

Courtesy of wfryer on flickr.com, CC BY-SA 2.0 : https://www.flickr.com/photos/wfryer/

Courtesy of wfryer on flickr.com, CC BY-SA 2.0 : https://www.flickr.com/photos/wfryer/


As public engagement lead for the UK Web Archive at the time, I was happily able to use the episode to draw attention to holdings of the same content in UKWA that were not retrospectively affected by a change to the robots.txt of the original site.

This week I’ve been prompted to think about another aspect of this issue by my own research. I’ve had occasion to spend some time looking at archived content from a political organisation in the UK, the values of which I deplore but which as scholars we need to understand. The UK Web Archive holds some data from this particular domain, but only back to 2005, and the earlier content is only available in the Internet Archive.

Some time ago I mused on a possible ‘Heisenberg principle of web archiving‘ – the idea that, as public consciousness of web archiving steadily grows, the consciousness of that fact begins to affect the behaviour of the live web. In 2012 it was hard to see how we might observe any such trend, and I don’t think we’re any closer to being able to do so. But the Conservative party episode highlights the vulnerability of content in the Internet Archive to a change in robots.txt policy by an organisation with something to hide and a new-found understanding of how web archiving works.

Put simply: the content I’ve been citing this week could later today disappear from view if the organisation concerned wanted it to, and was to come to understand how to make it happen. It is possible, in short, effectively to delete the archive – which is rather terrifying.

In the UK, at least, the danger of this is removed for content published after 2013, due to the provisions of Non-Print Legal Deposit. (And this is yet another argument for legal deposit provisions in every jurisdiction worldwide). In the meantime, as scholars, we are left with the uneasy awareness that the more we draw attention to the archive, the greater the danger to which it is exposed.

Why hoping private companies will just do the Right Thing doesn’t work

In the last few weeks I’ve been to several conferences on the issue of the preservation of online content for research, and in particular social media. This is an issue that is attracting a lot of attention at the moment: for examples, see Helen Hockx-Yu’s paper for last year’s IFLA conference, or the forthcoming TechWatch report from the Digital Preservation Coalition. As I myself blogged a little while ago, and (obliquely) suggested in this presentation on religion and social media, there’s growing interest from social scientists in using social media data – most typically Twitter or Facebook – to understand contemporary social phenomena. But whereas users of the archived web (such as myself) can rely on continued access to the data we use, and can expect to be able to point to that data such that others may follow and replicate our results, this isn’t the case with social media.

Commercial providers of social media platforms impose several different kinds of barriers: These can include: limits on the amount of data that may be requested in any one period of time; provision of samples of data created by proprietary algorithms which may not themselves be scrutinised; limits on how much of and/or which fields in a dataset may be shared with other researchers. These issues are well-known, and aren’t my main concern here. My concern is with how these restrictions are being discussed by scholars, librarians and archivists.

I’ve noticed an inability to imagine why it is that these restrictions are made, and as a result, a struggle to begin to think what the solutions might be. There has been a similar trend amongst the Open Access community, to paint commercial academic publishers as profit-hungry dinosaurs, making money without regard to the public good element of scholarly publishing happens. Regarding social media, it is viewed as simply a failure of good manners when a social media firm shuts down a service without providing for scholarly access to its archive, or does not allow free access to and reuse of its data to scholars. Why (the question is implicitly posed) don’t these organisations do the Right Thing? Surely everyone thinks that preserving this stuff is worthwhile, and that it is a duty of all providers?

But private corporations aren’t individuals, endowed with an idea of duty and a moral sense. Private corporations are legal abstractions: machines designed for the maximisation of return on capital. If they don’t do the Right Thing, it isn’t because the people who run them are bad people. No; it’s because the thing we want them to do (or not do) impacts adversely on revenue, or adds extra cost without corresponding additional revenue.

Fundamentally, a commercial organisation is likely to shut down an unprofitable service without regard to the archive unless (i) providing access to the archive is likely to yield research findings which will help future service development, or; (ii) it causes positive harm to the brand to shut it down (or helps the brand to be seen *not* to do so.) Similarly, they are unlikely to incur costs to run additional services for researchers, or to share valuable data unless (again) they stand to gain something from the research, however obliquely, or by doing so they either help or protect the brand.

At this point, readers may despair of getting anywhere in this regard, which I could understand. One way through this might be an enlargement of the scope of legal deposit legislation such that some categories of data (politicians’ tweets, say, given the recent episode over Politwoops) are deemed sufficiently significant to be treated as public records. There will be lobbying against, surely, but once such law is passed, companies will adapt business models to a changed circumstance, as they always have done. An even harder task is so to shift the terms of public discourse such that a publicly accessible record of this data is seen by the public as necessary. Another way is to build communities of researchers around particular services such that generalisable research about a service can be absorbed by the providers, thus showing that openness with the data leads to a gain in terms of research and development.

All of these are in their ways Herculean tasks, and I have no blueprint for them. But recognising the commercial realities of the situation would get us further than vague pieties about persuading private firms to do the Right Thing. It isn’t how they work.

Lecture at NUI Galway: ‘Prospects and pitfalls in web archives for research’

Some details of my public lecture at the National University of Ireland Galway in a couple of weeks:

Date:  Tuesday June 23rd, 3pm
Venue:  Moore Institute Seminar Room, G010, Hardiman Research Building ( map )
Title:   ‘A new class of scholarly resource? Prospects and pitfalls in using web archives for research’
Abstract:  Viewed globally, the process of archiving the web and providing access to that archive is some way ahead of scholars’ and archivists’ understanding of the uses scholars will make of this new class of resource. This lecture will make the case that scholars of contemporary life have a stake in the successful archiving of the web, and in helping determine its shape. After then examining the current state of web archiving in the UK and Ireland, it will present the results of recent and ongoing research into religious discourse in the British and Irish web domains, presenting both substantive conclusions and proposing methods and approaches that are more widely applicable to other scholarly issues.
[Update: the slides are now available on Slideshare ]

Web Archiving 101: a new course and podcast

It was a great pleasure to work with former colleagues at the University of London Computing Centre this week to deliver Web Archiving 101, a day course which forms part of ULCC’s highly successful Digital Preservation Training Programme. My thanks to Steph Taylor and Ed Pinsent, my fellow trainers, and also Sara Day Thomson from the Digital Preservation Coalition who taught the module on social media archiving.

To get a taste of the day, see the programme. A few days before, Ed, Steph and I also had a very enjoyable talk about the issues the course would raise: it’s available as a podcast in Soundcloud.

Understanding the web of faith: forthcoming book chapter

I’m very pleased to say that an essay of mine has been accepted for a forthcoming volume: The Web as History: the first two decades. It is edited by Niels Brügger and Ralph Schroeder, and will appear Open Access with UCL Press in 2016.

Here’s my abstract:

‘Much of the discourse that historians of contemporary religion until recently tracked in correspondence, periodical publication and print ephemera has migrated online. But the task of understanding religious discourse in the UK web space has hardly begun. The task is hard to undertake at the highest level since there are no second-level domains that serve as useful units of analysis — there is no faith.uk to match nhs.uk or ac.uk.

‘This chapter represents a first step towards understanding the evolution of the UK religious web space, by means of two interrelated case studies, which between them point to the agenda and content of a larger research project. Both studies utilise the JISC UK Web Domain Dataset for the period 1996-2008, as held by the British Library.

‘Firstly, it will examine the web archive footprint left by the public controversy in 2008 over the comments made by Rowan Williams, archbishop of Canterbury, on the matter of sharia law. Using both the link graph and a direct qualitative analysis of archived content, it will explore both the shape and the content of the controversy and show the degree to which religious debate had not only migrated from print to the web, but in doing so had engaged different actors and lost others, and changed in its tone.

‘Secondly, it will consider the growing tension in religious discourse between faith groups and organisations with a secularist agenda. Again, using the link graph and some qualitative analysis, it will explore the patterns in which linkages grew and shifted between the web estates of key but opposed organisations in relation to issues including faith schools and creationism, the reform of the law on blasphemy, and the place of the bishops in the House of Lords.

Will historians of the future be able to study Twitter?

Over the last year or so, the IHR Digital History seminar has become increasingly web-focussed, which is of course of interest to me (if not necessarily to everyone.) Last week we had an excellent paper from Jack Grieve of Aston University on the tracking of newly emerging words as they appeared in large corpora of tweets from the UK and the US. By amassing very large tweet datasets, he and his colleagues are able to observe the early traces of newly emerging words, and also (when those tweets were submitted from devices which attach geo-references) to see where those new words first appear, and how they spread. Jack and his colleagues are finding that words quite often emerge first (in the US) in the east and south-east (or California) and then spread towards the centre of the continent. They don’t necessarily spread in even waves across space, or even spring between urban centres and then to rural areas (as would have been my uneducated guess). Read more at the project site, treets.net, or watch the paper.

This kind of approach is quite impossible without the kind of very large-scale natural language data such as social media afford. This is particularly so as most words are (perhaps counter-intuitively) rather rare. In the corpus in question, the majority of the 67,000 most common words appear only once in 25 million words. Given this, datasets of billions of tweets are the minimum size necessary to be able to see the patterns.

It was interesting to me as a convenor to see the rather different spread of people who came to this paper, as opposed to the more usual digital history work the seminar showcases. Jack focussed on tweets posted since 2013; a time span that even the most contemporary historian would struggle to call their own; and so not so many of them came along – but we had perhaps our first mathematician instead. This was a shame, as Jack’s paper was a fascinating glimpse into the way that historical linguistics, and indeed other types of historical enquiry, might look in a couple of decades’ time.

But there is a caveat to this, which was beyond the scope of Jack’s paper, to do with the means by which this data will be accessible to scholars of 2014 working in (say) 2044. Jack and his colleagues work directly from the so-called Twitter “firehose”; they harvest every tweet coming from the Twitter API, and (on their own hardware) process each tweet and discard those that are not geo-coded to within the study area. This kind of work involves considerable local computing firepower, and (more importantly) is concerned with the now. It creates data in real time to answer questions of the very recent past.

Researchers working in 2044 and interested in 2014 may well be able to re-use this particular bespoke dataset (assuming it is preserved – a different matter of research data management, for another post sometime). However, they may equally well want to ask completely different questions, and so need data prepared in a quite different way. Right now, the future of the vast ocean of past tweets is not certain; and so it is not clear whether the scholar of 2044 will be able to create their own bespoke subset of data from the archive. The Library of Congress, to be sure, are receiving an archive of data from Twitter; but the access arrangements for this data are not clear, and (at present) are zero. So, in the same way that historians need to take some ownership of the future of the archived web, we need to become much more concerned about the future of social media: the primary sources that our graduate students, and their graduate students in turn, will need to work with two generations down the line.

Certainly, historians have always been used to working around and across the gaps in the historical record; it’s part of the basic skillset, to deal with the fragmentary survival of the record. But there is right now a moment in which major strategic decisions are to be made about that survival, and historians need to make themselves heard.

[This post also appears on the IHR Digital History Seminar blog.]

Religion, social media and the web archive

Late last year I was delighted to be invited to be one of four keynote speakers at a workshop on religion and social media at the International AAAI Conference on Web and Social Media in Oxford in May. Here are some initial thoughts on what I intend to say.

There has been an interesting upswing recently in scholarly interest in the ways in which religious people, and the organisations in which they gather together, represent themselves and communicate with others on social media. However, this work has been conducted relatively independently from the emerging body of scholarship on the archived web.

There are some reasons for this. First is the fact that much of the scholarship on social media tends to be focussed very firmly on the present. As such, data tends to be gathered directly from social media platforms “to order”, to match the particular research questions in view, and does not engage the various web archives that are in existence, whether at national libraries or the Internet Archive.

The second reason (which may indeed be the more important) is that traditional web archiving has limited success in archiving social media content. There are several well-documented reasons for this, not least the significant technical difficulties in capturing the content as it is presented in user interfaces such as that for Twitter or YouTube. Also, the data gathered is wrapped up in its presentation layer, rather than being neatly organised as a dataset for analysis. Aside from these technical challenges, the very social nature of social media – with multiple content creators co-existing and interacting on the same platform – adds considerable complexity to the task of the web archivist of determining which content can be archived under existing legal deposit frameworks.

So much for the reasons; but this gap between social media research and the archived web needs to be closed, because part of the story is missed. If we want to understand the evolution of the engagement of churches with social media, then we need to understand the ways in which traditional church websites integrated social media content within themselves, and from what point in time. As well as this, we need to be able to understand the content to which social media users were referring and linking – content which will increasingly often be found only in web archives as it disappears from the live web.

In Oxford, I shall be presenting some small case studies in the development of the web and social media presence of local churches, individuals and national church bodies in England and in Ireland. How quickly did churches begin to integrate their social media channels with their websites – which is to ask, at which point did social media become central to their communication strategies ? This is enabled by data made available from the British Library which covers the period from 1996 until 2013; the period in which social media grew from nothing to the prominence it now holds.

[Updated, 5 June 2015: here are the slides:
[http://www.slideshare.net/pj_webster/slideshelf]