The ethics of search filtering and big data: who decides ?

[Reflecting on discussions at the recent UK Internet Policy Forum, this post argues that societies as moral communities need to take a greater share in the decision-making about controversial issues on the web, such as search filtering and the use of open data. It won’t do to expect tech companies and data collectors to settle questions of ethics.]

Last week I was part of the large and engaged audience at the UK Internet Policy Forum meeting, convened by Nominet. The theme was ‘the Open Internet and the Digital Economy’, and the sessions I attended were on filtering and archiving, and on the uses of Big Data. And the two were bound together by a common underlying theme.

That theme was the relative responsibilities of tech providers, end users and government (and regulators, and legislators) to solve difficult issues of principle: of what should (and should not) be available through search; and which data about persons should truly be regarded as personal, and how they should be used.

On search: last autumn there was a wave of public, and then political concern about the risk of child pornography being available via search engine results. Something Should Be Done, it was said. But the issue – child pornography – was so emotive, and legally so clear-cut, that important distinctions were not clearly articulated. The production and distribution of images of this kind would clearly be in contravention of the law, even if no-one were ever to view them. And a recurring theme during the day was that these cases were (relatively) straightforward – if someone shows up with a court order, search engines will remove that content from their results, for all users; so will the British Library remove archived versions of that content from the UK Legal Deposit Web Archive.

Monitor padlock
But there are several classes of other web content about which no court order could be obtained. Content may well directly or indirectly cause harm to those who view it. But because that chain of causation is so dependent on context and so individual, no parliament could legislate in advance to stop the harm occurring, and no algorithm could hope to predict that harm would be caused. I myself am not harmed by a site that provides instructions on how to take one’s own life; but others may well be. There is also another broad category of content which causes no immediate and directly attributable harm, but might in the longer term conduce to a change in behaviour (violent movies, for instance). There is also content which may well cause distress or offence (but not harm); on religious grounds, say. No search provider can be expected to intuit which elements of this content should be removed entirely from search, or suggest to end users as the kind of thing they might not want to see.

These decisions need to be taken at a higher level and in more general terms. It depends on the existence of the kind of moral consensus which was clearly visible at earlier times in British history, but which has become weakened if not entirely destroyed since the ‘permissive’ legislation of the Sixties. The system of theatre censorship was abolished in the UK in 1968 because it had become obvious that there was no public consensus that it was necessary or desirable. A similar story could be told about the decriminalisation of male homosexuality in 1967, or the reform of the law on blasphemy in 2008. As Dave Coplin of Microsoft put it, we need to decide collectively what kind of society we want; once we know that, we can legislate for it, and the technology will follow.

The second session revolved around the issue of big data and privacy. Much can be dealt with by getting the nature of informed consent correct, although it is hard to know what ‘informed’ means; difficult to imagine in advance all the possible uses that data might be used, in order both to put and to answer the question ‘Do you consent?’.

But once again, the issues are wider than this, and it isn’t enough to declare that privacy must come first, as if this settled the issue. As Gilad Rosner suggested, the notion of personal data is not stable over time, or consistent between cultures. The terms of use of each of the world’s web archives are different, because different cultures have privileged different types of data as being ‘private’ or ‘personal’ or ‘sensitive’. Some cultures focus more on data about one’s health, or sexuality, or physical location, or travel, or mobile phone usage, or shopping patterns, or trade union membership, or religious affiliation, or postal address, or voting record and political party membership, or disability. None of these categories is self-evidently more or less sensitive than any of the others, and – again – these are decisions that need to be determined by society at large.

Tech companies and data collectors have responsibilities – to be transparent about the data they do have, and to co-operate quickly with law enforcement. They also must be part of the public conversation about where all these lines should be drawn, because public debate will never spontaneously anticipate all the possible use cases which need to be taken into account. In this we need their help. But ultimately, the decisions about what we do and don’t want must rest with us, collectively.

Introducing Web Archives for Historians

WebArchivesforHistoriansIt was a great pleasure last week, after several months, to be able to unveil Web Archives for Historians, a joint project with the excellent Ian Milligan of the University of Waterloo.

The premise is simple. We’re looking to crowd-source a bibliography of research and writing by historians who use or think about the making or use of web archives. Here’s what the site has to say:

“We want to know about works written by historians covering topics such as: (a) reflections on the need for web preservation, and its current state in different countries and globally as a whole; (b) how historians could, should or should not use web archives; (c) examples of actual uses of web archives as primary sources..”

Ian and I had been struck by just how few historians we knew of who were beginning to use web archives as primary sources, and how little there has been written on the topic. We aimed to provide a resource for historians who are getting interested in the topic, to publicise their work and find that of others.

It can include formal research articles or book chapters, but also substantial blog posts and conference papers, which we think reflects the diverse ways in which this type of work is likely to be communicated.

So: please do submit a title, or view the bibliography to date (which is shared on a Creative Commons basis). You can also sign up to express a general interest in the area. These details won’t be shared publicly, but you might just occasionally hear by email of interesting developments as and when we hear of them.

You can also find the project on Twitter @HistWebArchives

Reading old news in the web archive, distantly

One of the defining moments of Rowan Williams’ time as archbishop of Canterbury was the public reaction to his lecture in February 2008 on the interaction between English family law and Islamic shari’a law. As well as focussing attention on real and persistent issues of the interaction of secular law and religious practice, it also prompted much comment on the place of the Church of England in public life, the role of the archbishop, and on Williams personally. I tried to record a sample of the discussion in an earlier post.

Of course, a great deal of the media firestorm happened online. I want to take the episode as an example of the types of analysis that the systematic archiving of the web now makes possible: a new kind of what Franco Moretti called ‘distant reading.’

The British Library holds a copy of the holdings of the Internet Archive for the .uk top level domain for the period 1996-2010. One of the secondary datasets that the Library has made available is the Host Link Graph. With this data, it’s possible to begin examining how different parts of the UK web space referred to others. Which hosts linked to others, and from when until when ?

This graph shows the total number of unique hosts that were found linking at least once to archbishopofcanterbury.org in each year.

Canterbury unique linking hosts - bar

My hypothesis was that there should be more unique hosts linking to the archbishop’s site after February 2008, which is by and large borne out. The figure for 2008 is nearly 50% higher than for the previous year, and nearly 25% higher than the previous peak in 2004. This would suggest that a significant number of hosts that had not previously linked to the Canterbury site did so in 2008, quite possibly in reaction to the shari’a story.

What I had not expected to see was the total number fall back to trend in 2009 and 2010. I had rather expected to see the absolute numbers rise in 2008 and then stay at similar levels – that is, to see the links persist. The drop suggests that either large numbers of sites were revised to remove links that were thought to be ‘ephemeral’ (that is to say, actively removed), or that there is a more general effect in that certain types of “news” content are not (in web archivist terms) self-archiving. [Update 02/07/2014 - see comment below ]

The next step is for me to look in detail at those domains that linked only once to Canterbury, in 2008, and to examine these questions in a more qualitative way. Here then is distant reading leading to close reading.

Method
You can download the data, which is in the public domain, from here . Be sure to have plenty of hard disk space, as when unzipped the data is more than 120GB. The data looks like this:

2010 | churchtimes.co.uk | archbishopofcanterbury.org | 20

which tells you that in 2010, the Internet Archive captured 20 individual resources (usually, although not always, “pages”) in the Church Times site that linked to the archbishop’s site. My poor old laptop spent a whole night running through the dataset and extracting all the instances of the string “archbishopofcanterbury.org”.

Then I looked at the total numbers of unique hosts linking to the archbishop’s site in each year. In order to do so, I:

(i) stripped out those results which were outward links from a small number of captures of the archbishop’s site itself.

(ii) allowed for the occasions when the IA had captured the same host twice in a single year (which does not occur consistently from year to year.)

(iii) did not aggregate results for hosts that were part of a larger domain. This would have been easy to spot in the case of the larger media organisations such as the Guardian, which has multiple hosts (society,guardian.co.uk, education.guardian.co.uk, etc.) However, it is much harder to do reliably for all such cases without examining individual archived instances, which was not possible at this scale.

Assumptions

(i) that a host “abc.co.uk” held the same content as “www.abc.co.uk”.

(ii) that the Internet Archive were no more likely to miss hosts that linked to the Canterbury site than ones that did not – ie., if there are gaps in what the Internet Archive found, there is no reason to suppose that they systematically skew this particular analysis.

Book review: The Future of Scholarly Communication (Shorley and Jubb)

[This review appeared in the 24 July issue of Research Fortnight, and is reposted here by kind permission. For subscribers, it is also available here.]

Perhaps the one thing on which all the contributors to this volume could agree is that scholarly communication is changing, and quickly. As such, it is a brave publisher that commits to a collection such as this — in print alone, moreover. Such reflections risk being outdated before the ink dries.

The risk has been particularly acute in the last year, as policy announcements from government, funders, publishers and learned societies have come thick and fast as the implications of the Finch report, published in the summer of 2012, have been worked out. It’s a sign of this book’s lead time that it mentions Finch only twice, and briefly. That said, Michael Jubb, director of the Research Information Network, and Deborah Shorley, Scholarly Communications Adviser at Imperial College London, are to be congratulated for having assembled a collection that, even if it may not hold many surprises, is an excellent introduction to the issues. By and large, the contributions are clear and concise, and Jubb’s introduction is a model of lucidity and balance that would have merited publication in its own right as a summation of the current state of play.

As might be expected, there is much here about Open Access. Following Finch, the momentum towards making all publications stemming from publicly funded research free at the point of use is probably unstoppable. This necessitates a radical reconstruction of business models for publishers, and similarly fundamental change in working practices for scholars, journal editors and research libraries. Here Richard Bennett of Mendeley, the academic social network and reference manager recently acquired by Elsevier, gives the commercial publisher’s point of the view, while Mike McGrath gives a journal editor’s perspective that is as pugnacious as Bennett’s is anodyne. Robert Kiley writes on research funders, with particular reference to the Wellcome Trust, where he is head of digital services. Together with Jubb’s introduction and Mark Brown’s contribution on research libraries these pieces give a clear introduction to hotly contested issues.

There is welcome acknowledgement here that there are different forces at work in different disciplines, with STM being a good deal further on in implementing Open Access than the humanities. That said, all authors concentrate almost exclusively on the journal article, with little attention given to other formats, including the edited collection of essays, the textbook and — particularly crucial for the humanities — the monograph.

Thankfully, there’s more to scholarly communication than Open Access. The older linear process, where research resulted in a single fixed publication, disseminated to trusted repositories, libraries, that acted as the sole conduits of that work to scholars is breaking down. Research is increasingly communicated while it is in progress, with users contributing to the data on which research is based at every stage.

Fiona Courage and Jane Harvell provide a case study of the interaction between humanists and social scientists and their data from the long-established Mass Observation Archive. The availability of data in itself is prompting creative thinking about the nature of the published output: here, John Wood writes on how the data on which an article is founded can increasingly be integrated with the text. And the need to manage access to research data is one of several factors prompting a widening of the traditional scope of the research library.

Besides the changing roles of libraries and publishers, social media is allowing scholars themselves to become more active in how their work is communicated. Ellen Collins, also of RIN, explores the use of social media as means of sharing and finding information about research in progress or when formally published, and indeed as a supplementary or even alternative method of publication, particularly when reaching out to non-traditional audiences.

Collins also argues that so far social media have mimicked existing patterns of communication rather than disrupting them. She’s one of several authors injecting a note of cold realism that balances the technophile utopianism that can creep into collections of this kind. Katie Anders and Liz Elvidge, for example, note that researchers’ incentives to communicate creatively remain weak and indirect in comparison to the brute need to publish or perish. Similarly, David Prosser observes that research communication continues to look rather traditional because the mechanisms by which scholarship is rewarded have not changed, and those imperatives still outweigh the need for communication.

This collection expertly outlines the key areas of flux and uncertainty in scholarly communication. Since many of the issues will only be settled by major interventions by governments and research funders, this volume makes only as many firm predictions as one could expect. However, readers in need of a map to the terrain could do much worse than to start here.

[The Future of Scholarly Communication, edited by Deborah Shorley and Michael Jubb, is published by Facet, at £49.95.]

Web archives: a new class of primary source for historians ?

On June 11th I gave a short paper at the Digital History seminar at the Institute of Historical Research, looking at the implications of web archives for historical practice, and introducing some of the work I’ve been doing (at the British Library) with the JISC-funded Analytical Access to the Domain Dark Archive project. It picked up on themes in a previous post here.

There is also an audio version here at HistorySpot along with the second paper in the session, given by Richard Deswarte.

The abstract (for the two papers together) reads:

When viewed in historical context, the speed at which the world wide web has become fundamental to the exchange of information is perhaps unprecedented. The Internet Archive began its work in archiving the web in 1996, and since then national libraries and other memory institutions have followed suit in archiving the web along national or thematic lines. However, whilst scholars of the web as a system have been quick to embrace archived web materials as the stuff of their scholarship, historians have been slower in thinking through the nature and possible uses of a new class of primary source.

“In April 2013 the six legal deposit libraries for the UK were granted powers to archive the whole of the UK web domain, in parallel with the historic right of legal deposit for print. As such, over time there will be a near-comprehensive archive of the UK web available for historical analysis, which will grow and grow in value as the span of time it covers lengthens. This paper introduces the JISC-funded AADDA (Analytical Access to the Domain Dark Archive) project. Led by the Institute of Historical Research (IHR) in partnership with the British Library and the University of Cambridge, AADDA seeks to demonstrate the value of longitudinal web archives by means of the JISC UK Web Domain Dataset. This dataset includes the holdings of the Internet Archive for the UK for the period 1996-2010, purchased by the JISC and placed in the care of the British Library. The project has brought together scholars from the humanities and social sciences in order to begin to imagine what scholarly enquiry with assets such as these would look like.

Tidiness and reward: the British Evangelical Networks project

[The British Evangelical Networks project will create a crowd-sourced dataset of connections between twentieth-century evangelical ministers, their churches and the organisations that trained them and kept them connected. Here I argue that the project adopts an approach that can achieve what is beyond the capabilities of any single scholar. However, it will require participants to live dangerously, and embrace different approaches both to academic credit, and to tidiness.]

For a couple of years I’d been sitting on a good idea. Historians of British evangelicalism have for a long time had to rely on sources for a small number of well-known names. John Stott, for instance, has not one but two biographers, and a bibliographer to boot. But we know surprisingly little about the mass of evangelical ministers who served congregations; the foot-soldiers, as it were. There are some excellent studies of individual churches, but not nearly enough to begin to form anything like a national picture.

But what if we begin to trace the careers of evangelical ministers – from university through ministerial training to successive congregations ? Who trained with whom, and where did they later serve together ? Which were the evangelical congregations, and when did they start (or stop) being so ? We could start to map evangelical strength in particular localities, and see how co-operation between evangelicals in different churches might have developed. If we could begin to reconstruct the membership of para- and inter-church organisations, from the diocesan evangelical unions (in the Church of England) to the Evangelical Alliance, what a resource there would be for understanding the ways in which evangelicals interacted, and sustained themselves. And what did evangelicalism look like when viewed across the whole of the UK ? What were the exchanges of personnel between churches in England and Wales, say, or between Scotland and Northern Ireland ?.

But which single scholar could hope to complete such a task ? None – but that need not stop it happening. Much of the data needed to trace all these networks is already in the possession of individual scholars, as well as librarians and archivists, and members of individual churches with an interest in their own ‘family history’. All that is needed is a means of bringing it together; and that is what the British Evangelical Networks project aims to do.

The fundamental building block is what I’m calling a ‘connection’ – a single item of information that connects an individual evangelical minister with a local congregation, or a local or national organisation, at a point in time. Using a simple online form, contributors will be able to enter these connections, one by one or in batches. From time to time, all the connections will be moderated and made available as a dataset online. Scholars can then use the data, ask questions of it, uncover the gaps, and be inspired to fill those gaps. They can then add the new connections they have found, and so the cycle begins again:

Connect – Aggregate – Publish – Use – Connect.

But I don’t suppose it will be easy, because it will require different ways of thinking, both to do with credit and reward, and also about completeness, or tidiness.

Firstly, credit and reward. Those of us who were trained up in the way of the lone scholar tend to be protective of our information, dug from rocky soil at great expense of time and effort. Our currency has been our interpretation, and the authority it bestows. Some while ago I suggested that everyone could benefit from editing Wikipedia and making it better, even if that involved not being obviously credited, and the same applies here. I plan to make available data on the number of connections people contribute, in order that there is something to report to whichever authority needs to know how busy a scholar has been. Those who contribute will also have access to a more fully featured version of the dataset as it is released; those who don’t will be able to read it, but not much more. But still, it will still be less spectacular than a big book with OUP.

The other issue is about tidiness. Sharon Howard recently encouraged scholars to make more of the data we generate in the course of research available online for others to reuse. But this will involve overcoming a natural wariness of sharing anything “unfinished”. BEN will encourage contributors to submit a connection even if they do not have all the details, since another contributor can’t develop and strengthen a connection that hasn’t been made in the first place, however tentatively. The dataset as a whole is likely to remain incomplete in many places, and tentative in others; but neither of those things make it useless, if it is clear what the state of play is.

For scholars of British evangelicalism, such a resource could transform our understanding of the subject. But we’ll need to live a little dangerously.

Where should the digital humanities live ?

Don’t get me wrong. The cluster of work that bears the label ‘digital humanities’ is important; very important. I’ve spent the last decade or so of my working life in the gap between historians and application developers, trying to make sure that digital tools get designed in the ways historians need them to be designed. Projects digitising books; collaborative editing platforms; institutional repositories; Open Access journal platforms; web archives: I’ve done a similar job, more or less well, in each case. As well as that, I was (and remain) founding co-convener of the Digital History seminar at the Institute of Historical Research, which looks to showcase finished historical scholarship that would have been impossible without the digital, broadly defined.

But there is a problem with how we understand the term, I think. I receive the term as signifying a community of practice, of scholars employing new technological means to achieve the same ends as they did before ‘the digital’. And as that community of practice grows, one would naturally expect a degree of self-consciousness within it as to the distinctiveness of what we’re all doing. This is inevitable, and almost certainly helpful, as new journals, conferences and online spaces appear to in which work can get published that might be too innovative for traditional channels to handle, and for discussions about method to take place safely.

My worry is over the institutional location of this activity. Several universities have spotted the potential of locating DH people together, and so there are several Schools or Faculties or Departments of Digital Humanities, all centres of real excellence, in universities in the UK and elsewhere. It’s an institutional means of nurturing something important, and it seems to work. My concern is with the long-term.

As in all large organisations, the internal structures of universities have their own force in determining the shape of the work that goes on within them. Structures shape cultures and cultures influence behaviours. It’s nobody’s doing, but the effect is real.

A department has a head, who usually sits at the same table as the head of History, or Philosophy; and funds run down these channels, and reporting lines back up. And my concern is that this Digital Humanities, this enterprise that starts to be treated (in institutional terms) as a discipline in its own right, could become a silo. The unintended consequence of creating a permanent space in which to foster the new approach is that Dr So-and-So in English, or Philosophy, can say “Oh, a digital approach, you say ? You want DH – they’re over in the Perkins Building.” Enterprising individuals and projects can and do bridge these gaps between departments; but the effect of the existence of the silo on the general consciousness has to be reckoned with, and mitigating the effect takes time and effort.

Put it this way. When Microsoft Word came within the reach of university budgets, no-one proposed that a Department of Word-Processed Humanities be set up – although word-processing was a technology that became ubiquitous in a short space of time, and had profound and widespread and general effects on a crucial element of academic practice – just like the digital humanities. And right now, there are not Schools of Social Humanities, to foster communities of practice in the most effective use of Twitter for dissemination and impact. Both these were disruptive technologies which were (and are) promoted across departments, faculties and whole institutions until they needed (or need) promoting no longer.

The end game for a Faculty of DH should be that the use of the tools becomes so integrated within Classics, French and Theology that it can be disbanded, having done its job. DH isn’t a discipline; it’s a cluster of new techniques that give rise to new questions; but they are still questions of History, or Philosophy, or Classics; and it is in those spaces that the integration needs eventually to take place.

Wikipedia, authority and the free rider problem

[This post argues that historians have much to gain from getting involved in making Wikipedia authoritative, in spite of the many disincentives within the current ecology of academic research. However, to make it work, historians would need to embrace a more speculative and more risky model of collaborative work.]

I am a selfish Wikipedian. By which I mean, that while I am very happy to use Wikipedia, I have not been very serious about contributing to it. There are a small handful of pages for which I keep the further reading (reasonably) up to date, and correct if a particularly egregious error appears.  But it is sporadic, and one of the first things to be squeezed out if life gets busy.

And I wonder whether there aren’t real gains for historians from helping Wikipedia become truly authoritative, but which are obscured by natural disincentives in the way in which our scholarly ecosystem works.

Firstly, the disincentives. One is a residual wariness of something that can be edited by ‘just anyone’. I myself have dissuaded students from citing Wikipedia as an authority in itself, as part of what I am teaching is the ability to go to the scholarly article that is cited in Wikipedia, and indeed beyond it to the primary source. But my experience is that, in matters of fact, Wikipedia is very reliable unless it concerns a highly charged topic (the significance of Margaret Thatcher, say). And even the making of that judgement is an important part of learning to think critically about what it is we read.

Perhaps more significant is the fact that Wikipedia appears to be edited by no-one in particular. One of the contradictions of modern academic life is that most scholars would, I think, assert the existence of a common good, the pursuit of knowledge, towards which we work in some abstract sense. At the same time, the ways in which we are habituated to achieve that end are fundamentally about competition between scholars for scarce resources: attention, leading to esteem, leading to career advancement.

We write books and articles, which help us get and then keep a job. A smaller but growing number write blogs like this one, and tweet about those blogs. Part of this is about ‘impact’ (that is to say, increasing our share of those scarce quanta of public attention). And all of it depends on being identified as the creator of an item of intellectual property: tweet, blog post, article, book, media interview. Few, even at the wildest edges of the Open Access movement, propose licensing of scholarly outputs without attribution, even if a work may be licensed for the most radical of remixing. All depends on being known.

But Wikipedia doesn’t credit its authors, or at least not in a prominent and easily reportable way. And so the question arises: even though contributing to Wikipedia is to the common good, what is in it for me ?

The answer may depend on a more speculative and more risky model of collaborative work, but one which holds out the prospect of a genuinely authoritative resource, made by authorities. And that in turn should reward the best published work, in the good old-fashioned and citable way, by channelling readers to it. (It would be even better for works available Open Access.)

But it depends on everyone jumping together. As long as some contribute, but others only consume, there remains a classic economist’s ‘free rider’ problem. When people use a resource without ‘paying’ (in the form of their own time, and their own particular expertise) then the cost of production is unevenly spread, and the quality of the product denuded. But if editing Wikipedia became a genuinely widespread enterprise amongst scholars, then even if my contribution is not recognised with each and every edit, my ‘main’ work (if it is any good) will be cited and integrated into the fabric of Wikipedia by others. And we might get a more informed public debate about each and every matter, which looks like impact to me. Perhaps I should get more serious about this now.

What use is a personal tweet archive ?

A little while ago I wrote a post about the need to plan for archiving the digital “papers” of historians. In that post I talked about research data (what we used to called “notes”); about the systems that form the bridge between that data and the writing process; and about written outputs themselves, and their various iterations. It looked forward to a time when all these digital objects, in multiple formats but from one mind, are available to future students of the way the discipline has developed.

What that post neglected was data about the way I publicise my work. Perhaps one of the reasons we’ve been slow to think about this is that, at one time, most academics didn’t need to. Apart from giving papers at gatherings of the learned, the task of publicising one’s work belonged to the publisher. And if one’s publisher was the right one, then the work would inevitably end up in the hands of the small group of people who needed to know about it. And whilst the media don is not a new phenomenon, most historians might have thought such self-publicity outside the academy something of an embarrassment, even rather vulgar.

How times change. Universities are training their staff in dealing with the traditional media and in the most effective way of using social media. And this opens up a new category of data that ought to be archived, if only to understand how the push for ‘impact’ actually played out in these early years. And some of it is being archived. The Library of Congress are archiving every tweet, although it isn’t yet clear how that archive may be made available for use. The UK Web Archive, along with other national web archives, have been archiving selected blogs (including this one) for several years, and the EU-funded BlogForever project is looking to join those projects up. But this approach, valuable though it is, separates the content from the author, and from the rest of their digital archive. Whilst that link might be retrievable at a higher discovery layer, something important is still lost.

But now the helpful folk at Twitter, in a move that ought to be applauded, have made it very quick and easy to download an archive of one’s own tweets, right back to the beginning. And so I did: 1682 tweets, over 14.5 months. But what to do with it ?

Straight away, scrolling through a long CSV file starts to tell the story of the making of other things: the first retweet of someone else’s work which was subsequently to influence my own; the first traces of an idea, or even of a question I was beginning to ask, which spawned a blog post, and then a paper. I also find that I shared at least one link in more than two thirds of my tweets, which sounds public-spirited until I add that a good proportion were my own posts. I can start mining the data for key terms and themes, and how they ebbed and flowed.

It would be useful if there was a way to keep this data fresh, of course, to avoid going back to Twitter for a new download every so often. And, thanks to @mhawksey, there is a simple way of doing this, using Google Drive. Martin explains all here, with a handy video set-up guide.tweet archive

And so I now have a cloud-based archive of my tweets, complete with a basic search and browse web interface. This is now a lazy man’s look-up of old tweets and the resources they pointed to, searchable by handle, hashtag or key term.

But perhaps this is something about which most people are lazy. Social media provides us with an overwhelming stream of quite-interesting things, in amongst which are nuggets of gold. Those nuggets I can manage in the old way, by recording them properly, perhaps in a bibliography. I might even read them, one day. But the quite-interesting stuff, whilst being too much ever to record properly, will probably remain quite interesting. And so this provides a middle way between formal curation of a webliography and just searching the live web (which assumes I can remember enough about what I’m looking for.)

Might this archive now change my future tweeting ? Early days to judge perhaps. But I think it may, since I may now retweet and share in preference to using favourites, in order to get a link to a resource into the archive. I can also imagine starting to use personal hashtags, as a way of structuring my own archive at the same time as I tweet. Real-time curation perhaps ?

And I might share it too. Since this is now unambiguously my own data, rather than Twitter’s, I can licence it for reuse by others in larger corpora for analysis. Imagine a pooled archive of the tweets of many historians. Now that would be interesting.

Open Access and open licensing

Much of the recent concern about Open Access in the UK, at least for the humanities, has not been about the general principle, but rather about the means.

In my hearing, however, perhaps at least as much consternation was in reaction to the prospect of subsequently licensing those outputs for re-use using one or other of the Creative Commons suite of licences. CC allows various degrees of redistribution, and re-use, without further recourse to the author, but with credit given. Commercial use can be restricted (or not); the making of derivative works can be provided for (or not). You can Meet the Licenses here.

As an advocate of greater Open Access in the humanities, I suspect that Research Councils UK made a tactical error in suggesting that it intended to enforce the most liberal of these licenses. CC-BY ‘lets others distribute, remix, tweak, and build upon your work, even commercially, as long as they credit you for the original creation.’ Here’s why I think the focus on CC-BY has been a mistake, at this point.

Personally, I have never quite been convinced that ‘full’ or ‘real’ OA was dependent on maximally open licensing. I see free availability of the content for reading and citation as quite distinct from the subsequent reuse of that content in other ways. Both are desirable, but can be decoupled without damage. A move to any form of OA represents a major cultural change, albeit one that is necessary. Given this I would rather see an OA article with all rights reserved (as a staging post) than to not see that article at all. And to couple the two too closely risks the first goal by too strong an insistence on the second. Over time, cultures can and do change; but we ought to practice the art of the possible.

More generally, it isn’t yet clear to me what re-use of a traditional history article looks like. Quotation (with a reference) is a mode historians understand; so is citation as an authority in paraphrase. Both are possible from an article with all rights reserved. Compilation of readers and anthologies would be made easier by CC, but doesn’t require CC-BY. It also isn’t clear what ‘remixing’ of traditional historical writing looks like if it doesn’t involve quotation. Historians are also well used to acknowledging a seminal work in a footnote (or even once only in foreword or acknowledgments) without quoting it directly, but is this all that giving ‘credit’ for ‘remixing’ an idea really means ? If so, there is little to fear; but I’m not sure we know, yet.

Over time, there will be possibilities for data-mining in corpora of scholarly articles, but we ought to think on about whether this can be accommodated without full CC-BY. Much turns on the question of what counts as a derivative work in the context of an aggregated database, and what the output to the user is; and whether an insistence on  non-commercial re-use shuts down important future possibilities that we can’t yet foresee.

It may be that CC-BY is the right default option; my feeling is that it probably will be. But I think we should probably take more time to document some of these use cases, in order to plan a movement towards licensing for historical writing that is neither more restrictive nor more liberal than it need be, and allows scholars to dip in their toes without plunging in up to the neck. For now, there are horses we should avoid scaring, lest they bolt.