Why hoping private companies will just do the Right Thing doesn’t work

In the last few weeks I’ve been to several conferences on the issue of the preservation of online content for research, and in particular social media. This is an issue that is attracting a lot of attention at the moment: for examples, see Helen Hockx-Yu’s paper for last year’s IFLA conference, or the forthcoming TechWatch report from the Digital Preservation Coalition. As I myself blogged a little while ago, and (obliquely) suggested in this presentation on religion and social media, there’s growing interest from social scientists in using social media data – most typically Twitter or Facebook – to understand contemporary social phenomena. But whereas users of the archived web (such as myself) can rely on continued access to the data we use, and can expect to be able to point to that data such that others may follow and replicate our results, this isn’t the case with social media.

Commercial providers of social media platforms impose several different kinds of barriers: These can include: limits on the amount of data that may be requested in any one period of time; provision of samples of data created by proprietary algorithms which may not themselves be scrutinised; limits on how much of and/or which fields in a dataset may be shared with other researchers. These issues are well-known, and aren’t my main concern here. My concern is with how these restrictions are being discussed by scholars, librarians and archivists.

I’ve noticed an inability to imagine why it is that these restrictions are made, and as a result, a struggle to begin to think what the solutions might be. There has been a similar trend amongst the Open Access community, to paint commercial academic publishers as profit-hungry dinosaurs, making money without regard to the public good element of scholarly publishing happens. Regarding social media, it is viewed as simply a failure of good manners when a social media firm shuts down a service without providing for scholarly access to its archive, or does not allow free access to and reuse of its data to scholars. Why (the question is implicitly posed) don’t these organisations do the Right Thing? Surely everyone thinks that preserving this stuff is worthwhile, and that it is a duty of all providers?

But private corporations aren’t individuals, endowed with an idea of duty and a moral sense. Private corporations are legal abstractions: machines designed for the maximisation of return on capital. If they don’t do the Right Thing, it isn’t because the people who run them are bad people. No; it’s because the thing we want them to do (or not do) impacts adversely on revenue, or adds extra cost without corresponding additional revenue.

Fundamentally, a commercial organisation is likely to shut down an unprofitable service without regard to the archive unless (i) providing access to the archive is likely to yield research findings which will help future service development, or; (ii) it causes positive harm to the brand to shut it down (or helps the brand to be seen *not* to do so.) Similarly, they are unlikely to incur costs to run additional services for researchers, or to share valuable data unless (again) they stand to gain something from the research, however obliquely, or by doing so they either help or protect the brand.

At this point, readers may despair of getting anywhere in this regard, which I could understand. One way through this might be an enlargement of the scope of legal deposit legislation such that some categories of data (politicians’ tweets, say, given the recent episode over Politwoops) are deemed sufficiently significant to be treated as public records. There will be lobbying against, surely, but once such law is passed, companies will adapt business models to a changed circumstance, as they always have done. An even harder task is so to shift the terms of public discourse such that a publicly accessible record of this data is seen by the public as necessary. Another way is to build communities of researchers around particular services such that generalisable research about a service can be absorbed by the providers, thus showing that openness with the data leads to a gain in terms of research and development.

All of these are in their ways Herculean tasks, and I have no blueprint for them. But recognising the commercial realities of the situation would get us further than vague pieties about persuading private firms to do the Right Thing. It isn’t how they work.


Will historians of the future be able to study Twitter?

Over the last year or so, the IHR Digital History seminar has become increasingly web-focussed, which is of course of interest to me (if not necessarily to everyone.) Last week we had an excellent paper from Jack Grieve of Aston University on the tracking of newly emerging words as they appeared in large corpora of tweets from the UK and the US. By amassing very large tweet datasets, he and his colleagues are able to observe the early traces of newly emerging words, and also (when those tweets were submitted from devices which attach geo-references) to see where those new words first appear, and how they spread. Jack and his colleagues are finding that words quite often emerge first (in the US) in the east and south-east (or California) and then spread towards the centre of the continent. They don’t necessarily spread in even waves across space, or even spring between urban centres and then to rural areas (as would have been my uneducated guess). Read more at the project site, treets.net, or watch the paper.

This kind of approach is quite impossible without the kind of very large-scale natural language data such as social media afford. This is particularly so as most words are (perhaps counter-intuitively) rather rare. In the corpus in question, the majority of the 67,000 most common words appear only once in 25 million words. Given this, datasets of billions of tweets are the minimum size necessary to be able to see the patterns.

It was interesting to me as a convenor to see the rather different spread of people who came to this paper, as opposed to the more usual digital history work the seminar showcases. Jack focussed on tweets posted since 2013; a time span that even the most contemporary historian would struggle to call their own; and so not so many of them came along – but we had perhaps our first mathematician instead. This was a shame, as Jack’s paper was a fascinating glimpse into the way that historical linguistics, and indeed other types of historical enquiry, might look in a couple of decades’ time.

But there is a caveat to this, which was beyond the scope of Jack’s paper, to do with the means by which this data will be accessible to scholars of 2014 working in (say) 2044. Jack and his colleagues work directly from the so-called Twitter “firehose”; they harvest every tweet coming from the Twitter API, and (on their own hardware) process each tweet and discard those that are not geo-coded to within the study area. This kind of work involves considerable local computing firepower, and (more importantly) is concerned with the now. It creates data in real time to answer questions of the very recent past.

Researchers working in 2044 and interested in 2014 may well be able to re-use this particular bespoke dataset (assuming it is preserved – a different matter of research data management, for another post sometime). However, they may equally well want to ask completely different questions, and so need data prepared in a quite different way. Right now, the future of the vast ocean of past tweets is not certain; and so it is not clear whether the scholar of 2044 will be able to create their own bespoke subset of data from the archive. The Library of Congress, to be sure, are receiving an archive of data from Twitter; but the access arrangements for this data are not clear, and (at present) are zero. So, in the same way that historians need to take some ownership of the future of the archived web, we need to become much more concerned about the future of social media: the primary sources that our graduate students, and their graduate students in turn, will need to work with two generations down the line.

Certainly, historians have always been used to working around and across the gaps in the historical record; it’s part of the basic skillset, to deal with the fragmentary survival of the record. But there is right now a moment in which major strategic decisions are to be made about that survival, and historians need to make themselves heard.

[This post also appears on the IHR Digital History Seminar blog.]

Religion, social media and the web archive

Late last year I was delighted to be invited to be one of four keynote speakers at a workshop on religion and social media at the International AAAI Conference on Web and Social Media in Oxford in May. Here are some initial thoughts on what I intend to say.

There has been an interesting upswing recently in scholarly interest in the ways in which religious people, and the organisations in which they gather together, represent themselves and communicate with others on social media. However, this work has been conducted relatively independently from the emerging body of scholarship on the archived web.

There are some reasons for this. First is the fact that much of the scholarship on social media tends to be focussed very firmly on the present. As such, data tends to be gathered directly from social media platforms “to order”, to match the particular research questions in view, and does not engage the various web archives that are in existence, whether at national libraries or the Internet Archive.

The second reason (which may indeed be the more important) is that traditional web archiving has limited success in archiving social media content. There are several well-documented reasons for this, not least the significant technical difficulties in capturing the content as it is presented in user interfaces such as that for Twitter or YouTube. Also, the data gathered is wrapped up in its presentation layer, rather than being neatly organised as a dataset for analysis. Aside from these technical challenges, the very social nature of social media – with multiple content creators co-existing and interacting on the same platform – adds considerable complexity to the task of the web archivist of determining which content can be archived under existing legal deposit frameworks.

So much for the reasons; but this gap between social media research and the archived web needs to be closed, because part of the story is missed. If we want to understand the evolution of the engagement of churches with social media, then we need to understand the ways in which traditional church websites integrated social media content within themselves, and from what point in time. As well as this, we need to be able to understand the content to which social media users were referring and linking – content which will increasingly often be found only in web archives as it disappears from the live web.

In Oxford, I shall be presenting some small case studies in the development of the web and social media presence of local churches, individuals and national church bodies in England and in Ireland. How quickly did churches begin to integrate their social media channels with their websites – which is to ask, at which point did social media become central to their communication strategies ? This is enabled by data made available from the British Library which covers the period from 1996 until 2013; the period in which social media grew from nothing to the prominence it now holds.

[Updated, 5 June 2015: here are the slides:

Book review: The Future of Scholarly Communication (Shorley and Jubb)

[This review appeared in the 24 July issue of Research Fortnight, and is reposted here by kind permission. For subscribers, it is also available here.]

Perhaps the one thing on which all the contributors to this volume could agree is that scholarly communication is changing, and quickly. As such, it is a brave publisher that commits to a collection such as this — in print alone, moreover. Such reflections risk being outdated before the ink dries.

The risk has been particularly acute in the last year, as policy announcements from government, funders, publishers and learned societies have come thick and fast as the implications of the Finch report, published in the summer of 2012, have been worked out. It’s a sign of this book’s lead time that it mentions Finch only twice, and briefly. That said, Michael Jubb, director of the Research Information Network, and Deborah Shorley, Scholarly Communications Adviser at Imperial College London, are to be congratulated for having assembled a collection that, even if it may not hold many surprises, is an excellent introduction to the issues. By and large, the contributions are clear and concise, and Jubb’s introduction is a model of lucidity and balance that would have merited publication in its own right as a summation of the current state of play.

As might be expected, there is much here about Open Access. Following Finch, the momentum towards making all publications stemming from publicly funded research free at the point of use is probably unstoppable. This necessitates a radical reconstruction of business models for publishers, and similarly fundamental change in working practices for scholars, journal editors and research libraries. Here Richard Bennett of Mendeley, the academic social network and reference manager recently acquired by Elsevier, gives the commercial publisher’s point of the view, while Mike McGrath gives a journal editor’s perspective that is as pugnacious as Bennett’s is anodyne. Robert Kiley writes on research funders, with particular reference to the Wellcome Trust, where he is head of digital services. Together with Jubb’s introduction and Mark Brown’s contribution on research libraries these pieces give a clear introduction to hotly contested issues.

There is welcome acknowledgement here that there are different forces at work in different disciplines, with STM being a good deal further on in implementing Open Access than the humanities. That said, all authors concentrate almost exclusively on the journal article, with little attention given to other formats, including the edited collection of essays, the textbook and — particularly crucial for the humanities — the monograph.

Thankfully, there’s more to scholarly communication than Open Access. The older linear process, where research resulted in a single fixed publication, disseminated to trusted repositories, libraries, that acted as the sole conduits of that work to scholars is breaking down. Research is increasingly communicated while it is in progress, with users contributing to the data on which research is based at every stage.

Fiona Courage and Jane Harvell provide a case study of the interaction between humanists and social scientists and their data from the long-established Mass Observation Archive. The availability of data in itself is prompting creative thinking about the nature of the published output: here, John Wood writes on how the data on which an article is founded can increasingly be integrated with the text. And the need to manage access to research data is one of several factors prompting a widening of the traditional scope of the research library.

Besides the changing roles of libraries and publishers, social media is allowing scholars themselves to become more active in how their work is communicated. Ellen Collins, also of RIN, explores the use of social media as means of sharing and finding information about research in progress or when formally published, and indeed as a supplementary or even alternative method of publication, particularly when reaching out to non-traditional audiences.

Collins also argues that so far social media have mimicked existing patterns of communication rather than disrupting them. She’s one of several authors injecting a note of cold realism that balances the technophile utopianism that can creep into collections of this kind. Katie Anders and Liz Elvidge, for example, note that researchers’ incentives to communicate creatively remain weak and indirect in comparison to the brute need to publish or perish. Similarly, David Prosser observes that research communication continues to look rather traditional because the mechanisms by which scholarship is rewarded have not changed, and those imperatives still outweigh the need for communication.

This collection expertly outlines the key areas of flux and uncertainty in scholarly communication. Since many of the issues will only be settled by major interventions by governments and research funders, this volume makes only as many firm predictions as one could expect. However, readers in need of a map to the terrain could do much worse than to start here.

[The Future of Scholarly Communication, edited by Deborah Shorley and Michael Jubb, is published by Facet, at £49.95.]