Slow scholarship and fast blogging

Recently there has been some debate on whether academic blogging is good for you; part of a wider debate about the speed and pressure of contemporary academic life. (You can get a sense of the debate in the articles here and here.) Some of this is prompted by the widely-circulated Manifesto for Slow Scholarship, which points to the supposedly inevitable superficiality of academic interaction in social media channels. It calls for a return to a more leisurely, measured, considered mode of thinking and writing, that produces writing that is fully baked. (There are some sage comments on the manifesto at large on CelebYouth.org.)

This month marks the second birthday of this blog (at least in its current form and location). And so it seemed a good time to look back and see whether there is any evidence here of ‘fast scholarship’ which has been too fast. After an elapse of two years, are there posts here that in retrospect I might wish not to have published ? If so, there would be some justice in the charge.

In that two year period, I counted up some 74 individual posts. Some of these were reports of work that was appearing in print, or in other outlets, including extracts. Some were explicitly partial and forward looking, such as this post inviting comment on an abstract for a forthcoming conference. These have a natural shelf life.

Along with these, there are perhaps 45-50 posts which have the character of an essay: thinking that had not previously been published, and which were an expression of a reasonably settled view. How have these fared ?

Some which contained comment on live issues at a point in time have been overtaken by events and changing circumstances, such that they speak to issues that have either been settled, or have transmuted into something different. An example is this post on the Church of England and women bishops, and this, on disestablishment. Also in this category are various posts on the policy environment for Open Access in the humanities, in which policy statements from government and funders have come thick and fast. Others are in the character of reviews of fast-developing web services, such as Google Scholar Updates. That said, I think these remain reasonable and cogent points to have made at that time, and so I don’t intend to remove them.

But what of the others ? There are areas in which my thinking has deepened since the first time I posted about them. But (crucially) that growth in thought has not been away from the initial post, but deeper and wider in the same soil. This is indeed what  one would hope would happen – the act of first essaying something here is the stimulus to further thought.  And so, I don’t think there are any posts here which I now wish were not here, and not in the archived version in the UK Web Archive. From the evidence of this blog, at least, there is no contradiction between slow scholarship and fast blogging.

 

 

The ethics of search filtering and big data: who decides ?

[Reflecting on discussions at the recent UK Internet Policy Forum, this post argues that societies as moral communities need to take a greater share in the decision-making about controversial issues on the web, such as search filtering and the use of open data. It won’t do to expect tech companies and data collectors to settle questions of ethics.]

Last week I was part of the large and engaged audience at the UK Internet Policy Forum meeting, convened by Nominet. The theme was ‘the Open Internet and the Digital Economy’, and the sessions I attended were on filtering and archiving, and on the uses of Big Data. And the two were bound together by a common underlying theme.

That theme was the relative responsibilities of tech providers, end users and government (and regulators, and legislators) to solve difficult issues of principle: of what should (and should not) be available through search; and which data about persons should truly be regarded as personal, and how they should be used.

On search: last autumn there was a wave of public, and then political concern about the risk of child pornography being available via search engine results. Something Should Be Done, it was said. But the issue – child pornography – was so emotive, and legally so clear-cut, that important distinctions were not clearly articulated. The production and distribution of images of this kind would clearly be in contravention of the law, even if no-one were ever to view them. And a recurring theme during the day was that these cases were (relatively) straightforward – if someone shows up with a court order, search engines will remove that content from their results, for all users; so will the British Library remove archived versions of that content from the UK Legal Deposit Web Archive.

Monitor padlock
But there are several classes of other web content about which no court order could be obtained. Content may well directly or indirectly cause harm to those who view it. But because that chain of causation is so dependent on context and so individual, no parliament could legislate in advance to stop the harm occurring, and no algorithm could hope to predict that harm would be caused. I myself am not harmed by a site that provides instructions on how to take one’s own life; but others may well be. There is also another broad category of content which causes no immediate and directly attributable harm, but might in the longer term conduce to a change in behaviour (violent movies, for instance). There is also content which may well cause distress or offence (but not harm); on religious grounds, say. No search provider can be expected to intuit which elements of this content should be removed entirely from search, or suggest to end users as the kind of thing they might not want to see.

These decisions need to be taken at a higher level and in more general terms. It depends on the existence of the kind of moral consensus which was clearly visible at earlier times in British history, but which has become weakened if not entirely destroyed since the ‘permissive’ legislation of the Sixties. The system of theatre censorship was abolished in the UK in 1968 because it had become obvious that there was no public consensus that it was necessary or desirable. A similar story could be told about the decriminalisation of male homosexuality in 1967, or the reform of the law on blasphemy in 2008. As Dave Coplin of Microsoft put it, we need to decide collectively what kind of society we want; once we know that, we can legislate for it, and the technology will follow.

The second session revolved around the issue of big data and privacy. Much can be dealt with by getting the nature of informed consent correct, although it is hard to know what ‘informed’ means; difficult to imagine in advance all the possible uses that data might be used, in order both to put and to answer the question ‘Do you consent?’.

But once again, the issues are wider than this, and it isn’t enough to declare that privacy must come first, as if this settled the issue. As Gilad Rosner suggested, the notion of personal data is not stable over time, or consistent between cultures. The terms of use of each of the world’s web archives are different, because different cultures have privileged different types of data as being ‘private’ or ‘personal’ or ‘sensitive’. Some cultures focus more on data about one’s health, or sexuality, or physical location, or travel, or mobile phone usage, or shopping patterns, or trade union membership, or religious affiliation, or postal address, or voting record and political party membership, or disability. None of these categories is self-evidently more or less sensitive than any of the others, and – again – these are decisions that need to be determined by society at large.

Tech companies and data collectors have responsibilities – to be transparent about the data they do have, and to co-operate quickly with law enforcement. They also must be part of the public conversation about where all these lines should be drawn, because public debate will never spontaneously anticipate all the possible use cases which need to be taken into account. In this we need their help. But ultimately, the decisions about what we do and don’t want must rest with us, collectively.

Introducing Web Archives for Historians

WebArchivesforHistoriansIt was a great pleasure last week, after several months, to be able to unveil Web Archives for Historians, a joint project with the excellent Ian Milligan of the University of Waterloo.

The premise is simple. We’re looking to crowd-source a bibliography of research and writing by historians who use or think about the making or use of web archives. Here’s what the site has to say:

“We want to know about works written by historians covering topics such as: (a) reflections on the need for web preservation, and its current state in different countries and globally as a whole; (b) how historians could, should or should not use web archives; (c) examples of actual uses of web archives as primary sources..”

Ian and I had been struck by just how few historians we knew of who were beginning to use web archives as primary sources, and how little there has been written on the topic. We aimed to provide a resource for historians who are getting interested in the topic, to publicise their work and find that of others.

It can include formal research articles or book chapters, but also substantial blog posts and conference papers, which we think reflects the diverse ways in which this type of work is likely to be communicated.

So: please do submit a title, or view the bibliography to date (which is shared on a Creative Commons basis). You can also sign up to express a general interest in the area. These details won’t be shared publicly, but you might just occasionally hear by email of interesting developments as and when we hear of them.

You can also find the project on Twitter @HistWebArchives

In defence of pseudonymity

Pseudonymity has had a bad press recently. A moral consensus seems to have formed that there is a problem not merely with behaving badly online while not under your real name, but with adopting a pseudonym at all. Pseudonymous authors “hide” behind their noms de plume; they lack the courage of their convictions; they are in some sense cowardly.

I don’t want here to get into the powerful imperatives of self-preservation that make pseudonymous writing a necessity when resisting a tyrannical government. I’d like to explore the particular reason why I myself blog elsewhere, and tweet, under a pseudonym (which I am clearly not about to disclose here).

I write for a living, more or less. I publish academic works, and blog here and elsewhere on the areas in which I am either directly professionally concerned, or on those subjects in which I am expert enough to make some observations. As more and more historians start “doing history in public”, this hybrid model of communicating what we as scholars do will become more and more important. And, as more and more of the web is routinely archived by the Internet Archive or the UK Web Archive, all of that communication which might previously have happened in person or in conferences, will now persist in the digital record. And since all these various utterances are linked together in various ways, such as Google Authorship, it will become easier for readers to trawl back through them all, and to put them together.

Given this, it is increasingly difficult to keep open a space in which to express an opinion just as a citizen without it becoming part of a professional profile. There are many issues in contemporary life on which I have views, but without any particular expertise. I’ve written elsewhere on the reasons I just write, and some of those thoughts are clarified in my own mind by the discipline of putting them online. And so my “other blog” gives the space to work out those off-piste ideas without them becoming mixed with my more “professional” writing.

And (incidentally) this is why I am not a “public intellectual”. The concept, at least in the UK, seems to involve the bringing to bear of a general intelligence, honed in one field, to matters of more general interest. Stefan Collini and others have already pointed out the tensions in the role. I have no view on whether or not the specific professional reputation of (say) David Starkey in relation to Tudor England is compromised by taking part in The Moral Maze. My point is simply that maintaining a separation between professional and pseudonymous selves in public means the question does not arise.

Reading old news in the web archive, distantly

One of the defining moments of Rowan Williams’ time as archbishop of Canterbury was the public reaction to his lecture in February 2008 on the interaction between English family law and Islamic shari’a law. As well as focussing attention on real and persistent issues of the interaction of secular law and religious practice, it also prompted much comment on the place of the Church of England in public life, the role of the archbishop, and on Williams personally. I tried to record a sample of the discussion in an earlier post.

Of course, a great deal of the media firestorm happened online. I want to take the episode as an example of the types of analysis that the systematic archiving of the web now makes possible: a new kind of what Franco Moretti called ‘distant reading.’

The British Library holds a copy of the holdings of the Internet Archive for the .uk top level domain for the period 1996-2010. One of the secondary datasets that the Library has made available is the Host Link Graph. With this data, it’s possible to begin examining how different parts of the UK web space referred to others. Which hosts linked to others, and from when until when ?

This graph shows the total number of unique hosts that were found linking at least once to archbishopofcanterbury.org in each year.

Canterbury unique linking hosts - bar

My hypothesis was that there should be more unique hosts linking to the archbishop’s site after February 2008, which is by and large borne out. The figure for 2008 is nearly 50% higher than for the previous year, and nearly 25% higher than the previous peak in 2004. This would suggest that a significant number of hosts that had not previously linked to the Canterbury site did so in 2008, quite possibly in reaction to the shari’a story.

What I had not expected to see was the total number fall back to trend in 2009 and 2010. I had rather expected to see the absolute numbers rise in 2008 and then stay at similar levels – that is, to see the links persist. The drop suggests that either large numbers of sites were revised to remove links that were thought to be ‘ephemeral’ (that is to say, actively removed), or that there is a more general effect in that certain types of “news” content are not (in web archivist terms) self-archiving.

The next step is for me to look in detail at those domains that linked only once to Canterbury, in 2008, and to examine these questions in a more qualitative way. Here then is distant reading leading to close reading.

Method
You can download the data, which is in the public domain, from here . Be sure to have plenty of hard disk space, as when unzipped the data is more than 120GB. The data looks like this:

2010 | churchtimes.co.uk | archbishopofcanterbury.org | 20

which tells you that in 2010, the Internet Archive captured 20 individual resources (usually, although not always, “pages”) in the Church Times site that linked to the archbishop’s site. My poor old laptop spent a whole night running through the dataset and extracting all the instances of the string “archbishopofcanterbury.org”.

Then I looked at the total numbers of unique hosts linking to the archbishop’s site in each year. In order to do so, I:

(i) stripped out those results which were outward links from a small number of captures of the archbishop’s site itself.

(ii) allowed for the occasions when the IA had captured the same host twice in a single year (which does not occur consistently from year to year.)

(iii) did not aggregate results for hosts that were part of a larger domain. This would have been easy to spot in the case of the larger media organisations such as the Guardian, which has multiple hosts (society,guardian.co.uk, education.guardian.co.uk, etc.) However, it is much harder to do reliably for all such cases without examining individual archived instances, which was not possible at this scale.

Assumptions

(i) that a host “abc.co.uk” held the same content as “www.abc.co.uk”.

(ii) that the Internet Archive were no more likely to miss hosts that linked to the Canterbury site than ones that did not – ie., if there are gaps in what the Internet Archive found, there is no reason to suppose that they systematically skew this particular analysis.

Web archives: a new class of primary source for historians ?

On June 11th I gave a short paper at the Digital History seminar at the Institute of Historical Research, looking at the implications of web archives for historical practice, and introducing some of the work I’ve been doing (at the British Library) with the JISC-funded Analytical Access to the Domain Dark Archive project. It picked up on themes in a previous post here.

There is also an audio version here at HistorySpot along with the second paper in the session, given by Richard Deswarte.

The abstract (for the two papers together) reads:

When viewed in historical context, the speed at which the world wide web has become fundamental to the exchange of information is perhaps unprecedented. The Internet Archive began its work in archiving the web in 1996, and since then national libraries and other memory institutions have followed suit in archiving the web along national or thematic lines. However, whilst scholars of the web as a system have been quick to embrace archived web materials as the stuff of their scholarship, historians have been slower in thinking through the nature and possible uses of a new class of primary source.

“In April 2013 the six legal deposit libraries for the UK were granted powers to archive the whole of the UK web domain, in parallel with the historic right of legal deposit for print. As such, over time there will be a near-comprehensive archive of the UK web available for historical analysis, which will grow and grow in value as the span of time it covers lengthens. This paper introduces the JISC-funded AADDA (Analytical Access to the Domain Dark Archive) project. Led by the Institute of Historical Research (IHR) in partnership with the British Library and the University of Cambridge, AADDA seeks to demonstrate the value of longitudinal web archives by means of the JISC UK Web Domain Dataset. This dataset includes the holdings of the Internet Archive for the UK for the period 1996-2010, purchased by the JISC and placed in the care of the British Library. The project has brought together scholars from the humanities and social sciences in order to begin to imagine what scholarly enquiry with assets such as these would look like.

What use is a personal tweet archive ?

A little while ago I wrote a post about the need to plan for archiving the digital “papers” of historians. In that post I talked about research data (what we used to called “notes”); about the systems that form the bridge between that data and the writing process; and about written outputs themselves, and their various iterations. It looked forward to a time when all these digital objects, in multiple formats but from one mind, are available to future students of the way the discipline has developed.

What that post neglected was data about the way I publicise my work. Perhaps one of the reasons we’ve been slow to think about this is that, at one time, most academics didn’t need to. Apart from giving papers at gatherings of the learned, the task of publicising one’s work belonged to the publisher. And if one’s publisher was the right one, then the work would inevitably end up in the hands of the small group of people who needed to know about it. And whilst the media don is not a new phenomenon, most historians might have thought such self-publicity outside the academy something of an embarrassment, even rather vulgar.

How times change. Universities are training their staff in dealing with the traditional media and in the most effective way of using social media. And this opens up a new category of data that ought to be archived, if only to understand how the push for ‘impact’ actually played out in these early years. And some of it is being archived. The Library of Congress are archiving every tweet, although it isn’t yet clear how that archive may be made available for use. The UK Web Archive, along with other national web archives, have been archiving selected blogs (including this one) for several years, and the EU-funded BlogForever project is looking to join those projects up. But this approach, valuable though it is, separates the content from the author, and from the rest of their digital archive. Whilst that link might be retrievable at a higher discovery layer, something important is still lost.

But now the helpful folk at Twitter, in a move that ought to be applauded, have made it very quick and easy to download an archive of one’s own tweets, right back to the beginning. And so I did: 1682 tweets, over 14.5 months. But what to do with it ?

Straight away, scrolling through a long CSV file starts to tell the story of the making of other things: the first retweet of someone else’s work which was subsequently to influence my own; the first traces of an idea, or even of a question I was beginning to ask, which spawned a blog post, and then a paper. I also find that I shared at least one link in more than two thirds of my tweets, which sounds public-spirited until I add that a good proportion were my own posts. I can start mining the data for key terms and themes, and how they ebbed and flowed.

It would be useful if there was a way to keep this data fresh, of course, to avoid going back to Twitter for a new download every so often. And, thanks to @mhawksey, there is a simple way of doing this, using Google Drive. Martin explains all here, with a handy video set-up guide.tweet archive

And so I now have a cloud-based archive of my tweets, complete with a basic search and browse web interface. This is now a lazy man’s look-up of old tweets and the resources they pointed to, searchable by handle, hashtag or key term.

But perhaps this is something about which most people are lazy. Social media provides us with an overwhelming stream of quite-interesting things, in amongst which are nuggets of gold. Those nuggets I can manage in the old way, by recording them properly, perhaps in a bibliography. I might even read them, one day. But the quite-interesting stuff, whilst being too much ever to record properly, will probably remain quite interesting. And so this provides a middle way between formal curation of a webliography and just searching the live web (which assumes I can remember enough about what I’m looking for.)

Might this archive now change my future tweeting ? Early days to judge perhaps. But I think it may, since I may now retweet and share in preference to using favourites, in order to get a link to a resource into the archive. I can also imagine starting to use personal hashtags, as a way of structuring my own archive at the same time as I tweet. Real-time curation perhaps ?

And I might share it too. Since this is now unambiguously my own data, rather than Twitter’s, I can licence it for reuse by others in larger corpora for analysis. Imagine a pooled archive of the tweets of many historians. Now that would be interesting.

Religion, politics and law in contemporary Britain: a web archive

[This is an expanded version of a post first published in the UK Web Archive blog.]

It has been over two years in the making, but I am delighted to be able to say that my own special collection in the UK Web Archive is now online.

UKWA (for which I am engagement and liaison lead, based at the British Library) collects and preserves websites of scholarly and cultural importance for the UK web domain. Already UKWA collect some 11,000 sites, and has more than 50,000 instances in total, with series of snapshots of some sites going back the best part of a decade. That’s a lot of data, and so one of the ways into the archive is by means of the special collection, of sites on a particular theme.religion politics law thumbnail

A couple of years ago, long before coming to the BL, I joined a project at the Library which brought together a group of scholars to guest-curate special collections on our research interests. I had become interested in the sharpening of the terms of debate about the place of religion in British public life, particularly since 9/11 and the London bombings in 2005. I’ve long been interested in public debate about church and state; but until relatively recently this happened by means of the print press, public oratory, ephemeral publication and the broadcast media. It struck me that a good deal of this debate had already moved online, and so new ways of capturing and preserving it were going to be needed. And so, the ‘politics of religion collection’ (as it was then known) was born. (See these posts on my progress.)

I fairly soon realised why I’m not an archivist, since all sorts of unfamiliar questions hove into view. When archiving the web, what is the base unit ? A whole domain, such as www.bbc.co.uk ? Or a single URL ? Several sites, like that of the National Secular Society or the Christian Institute were central to my concerns, and so could be included whole. But what does one do with a single post on a PR blog about the handling of the sharia law row by Rowan Williams and his staff ? In fact, the collection is a mixture of whole domains and individual directories or pages from larger sites; an uneasy compromise, but a necessary one.

Also (and I may as well come straight out with it), the collection is selective, and thus in a real sense subjective. As a watcher of contemporary religious politics, against the backdrop of recent history, my impression is that the place of religious ideas, symbols and organisations in public life is at its most contested for decades. Historians are traditionally wary of assessing the significance of present trends, since it leaves hostages to fortune and later events. Yet, all archival choices from a pool of material not defined in advance by provenance involve some judgements as to significance; and historians are as well suited as any to make those judgements. And so I have put the collection together now to enable future historians to begin to answer the questions which I anticipate will be significant. (See an older post on why I think historians should engage with this way of working.)

There were other issues. Were I the archivist for a particular organisation, I’d have no problem with getting permission to add material to my archive: everything produced in-house would be in view. The problem for web archiving is that we’re dealing with other people’s copyright work, and so an individual permission is needed for each site. I have a long list of sites which I would dearly love to add to the collection, but for which (for various reasons) we’ve had no response. So, if you are the owner of Protest the Pope, or Holy Redundant, or Christians in Politics, please get in touch. For now, even if the collection cannot be anything like comprehensive, I do hope that it is at least coherent.

There are particular strengths, and some gaps. It includes many campaigning organisations, both secularist and religious, and is heavy on the conservative Christian groups about which I myself know most. It is very light on non-Christian faiths, since I know the field much less well.  It is still very much open, however, and so suggestions of sites that ought to be included are very welcome, via this blog or at the UKWA Nominate a Site page.

What can you do with it ?  For now, there is a simple browse function; and the collection can be searched on its own.  And over time, all sorts of uses will present themselves, which we can’t currently imagine. But the data is there: a growing longitudinal series of timed instances of websites, identified as thematically related; that is to say, an archive.

Why historians should care about web archiving

Someone said to me at a conference recently (not his exact words), “if we can’t get historians interested in web archives, then who can we reach ?” But so far, there hasn’t been much visible engagement between contemporary historians and web archives, even though those archives are now well established at national memory institutions such as the Library of Congress or the British Library. [Full disclosure: the latter employs me, but this post represents a personal view, not the Library’s.] And as an historian who has been involved with web archives since before coming to the BL, I think this needs to change.

The evidence is mounting of how vulnerable the web actually is. One study found that 11% of content shared via social media will have disappeared a year later, and another 7% each year after that – a startling rate. And since there was a time lag between the migration of the archival record into a digital-only mode and the establishment of web archives, there is already a large hole in the record from perhaps the mid-nineties to the mid-noughties. A recent post of mine over at the UK Web Archive blog showed just how significant are some of the sites that now exist only in web archives; and that’s only the ones the UKWA managed to capture in time. We can only guess at what is now lost forever.

So, in twenty or thirty years’ time, historians of the very late twentieth century will have reason to regret that no-one thought to keep their primary sources safe for them. But there is another problem. It is a brave historian who writes on the very recent past, a remote subject indeed; I myself wrote an article in 2004 that extended up to 1990, and not without some unease about the hostages to scholarly fortune it gave. And so most of the historians who have the greatest personal stake in archiving the web right now haven’t yet entered the profession. I would argue that historians are uniquely well-placed to view the present in relation to the past, and thus to anticipate those aspects of the present for which there is most need for a record. But it would take a significant change in culture such that historians working now start to take a hand in preserving sources for our successors.

“But this isn’t my job”, the response might be. “Surely this is what archivists are for ? (It always used to be.)” Granted, in a pre-digital world, institutional archivists in government, civil society, the churches, concentrated on capturing unpublished materials produced in-house, took in those personal archives that were offered to them, and left the copyright libraries to pick up books and journals. If the ephemeral stuff in the cracks didn’t survive, then such was life. Now, the volume of words is so much greater, and the means of disseminating them so dispersed, that archivists as a profession (already an undervalued and underpaid one, I might add) can’t hope even to see, let alone arrange to capture everything of note.

So: we need a new model of archival curation, based on a partnership between archivists, scholars and the public. The technical means are there; it simply needs a new form of engagement, and we historians can help make it happen.

A Heisenberg Principle of web archiving ?

Whatever it means to real scientists, the famous ‘uncertainty principle’ of Werner Heisenberg is sometime popularly taken to mean that it is impossible closely to observe something without in some way altering it. It’s also a conundrum that has faced anthropologists when observing cultures far removed from their own: how far does the consciousness of being observed alter the behaviour of the subject ?

I’ve been publishing in print in the traditional way for some years now, and everyone knows that books are (in theory) permanent, that they find their way into libraries; and so one writes conscious that the words cannot be unwritten. Writing for the web, however, has had a more transient aesthetic: I can write with the freedom that comes from knowing that (in a site I control) I can retrospectively edit at will, should I choose to. There are good scholarly reasons not to, to do with making my work reliably citable; but in the final analysis I am not bound by them.

So far, the visibility of web archiving by national memory institutions is not yet high. In addition, if the UK Web Archive considers a site important enough to archive, then it must gain explicit permission; and by no means all website owners give that consent.  This blog is already being archived by the UK Web Archive  (last crawl in April 2012); but had I been at all concerned about the things I write having a permanent existence, then I could have withheld permission.

On the horizon is a major piece of legislation that could subtly but importantly change things: the Legal Deposit Libraries (Non-print Works) Regulations 2013 (see the most recent public consultations here.) As and when these successfully negotiate the passage through Parliament, any website in the .uk domain could be archived for posterity without the explicit consent of the owner.

The change in the law in itself isn’t my main point, however: the effects of increasing consciousness of it is. Put simply: will some words that might have been written in 2012 not be written in 2014 because the author was conscious that they could not later be retracted ? I think it likely. Would it be a ‘bad thing’ ? I don’t suppose we know yet; but we ought to be thinking about it.