Will historians of the future be able to study Twitter?

Over the last year or so, the IHR Digital History seminar has become increasingly web-focussed, which is of course of interest to me (if not necessarily to everyone.) Last week we had an excellent paper from Jack Grieve of Aston University on the tracking of newly emerging words as they appeared in large corpora of tweets from the UK and the US. By amassing very large tweet datasets, he and his colleagues are able to observe the early traces of newly emerging words, and also (when those tweets were submitted from devices which attach geo-references) to see where those new words first appear, and how they spread. Jack and his colleagues are finding that words quite often emerge first (in the US) in the east and south-east (or California) and then spread towards the centre of the continent. They don’t necessarily spread in even waves across space, or even spring between urban centres and then to rural areas (as would have been my uneducated guess). Read more at the project site, treets.net, or watch the paper.

This kind of approach is quite impossible without the kind of very large-scale natural language data such as social media afford. This is particularly so as most words are (perhaps counter-intuitively) rather rare. In the corpus in question, the majority of the 67,000 most common words appear only once in 25 million words. Given this, datasets of billions of tweets are the minimum size necessary to be able to see the patterns.

It was interesting to me as a convenor to see the rather different spread of people who came to this paper, as opposed to the more usual digital history work the seminar showcases. Jack focussed on tweets posted since 2013; a time span that even the most contemporary historian would struggle to call their own; and so not so many of them came along – but we had perhaps our first mathematician instead. This was a shame, as Jack’s paper was a fascinating glimpse into the way that historical linguistics, and indeed other types of historical enquiry, might look in a couple of decades’ time.

But there is a caveat to this, which was beyond the scope of Jack’s paper, to do with the means by which this data will be accessible to scholars of 2014 working in (say) 2044. Jack and his colleagues work directly from the so-called Twitter “firehose”; they harvest every tweet coming from the Twitter API, and (on their own hardware) process each tweet and discard those that are not geo-coded to within the study area. This kind of work involves considerable local computing firepower, and (more importantly) is concerned with the now. It creates data in real time to answer questions of the very recent past.

Researchers working in 2044 and interested in 2014 may well be able to re-use this particular bespoke dataset (assuming it is preserved – a different matter of research data management, for another post sometime). However, they may equally well want to ask completely different questions, and so need data prepared in a quite different way. Right now, the future of the vast ocean of past tweets is not certain; and so it is not clear whether the scholar of 2044 will be able to create their own bespoke subset of data from the archive. The Library of Congress, to be sure, are receiving an archive of data from Twitter; but the access arrangements for this data are not clear, and (at present) are zero. So, in the same way that historians need to take some ownership of the future of the archived web, we need to become much more concerned about the future of social media: the primary sources that our graduate students, and their graduate students in turn, will need to work with two generations down the line.

Certainly, historians have always been used to working around and across the gaps in the historical record; it’s part of the basic skillset, to deal with the fragmentary survival of the record. But there is right now a moment in which major strategic decisions are to be made about that survival, and historians need to make themselves heard.

[This post also appears on the IHR Digital History Seminar blog.]

What use is a personal tweet archive ?

A little while ago I wrote a post about the need to plan for archiving the digital “papers” of historians. In that post I talked about research data (what we used to called “notes”); about the systems that form the bridge between that data and the writing process; and about written outputs themselves, and their various iterations. It looked forward to a time when all these digital objects, in multiple formats but from one mind, are available to future students of the way the discipline has developed.

What that post neglected was data about the way I publicise my work. Perhaps one of the reasons we’ve been slow to think about this is that, at one time, most academics didn’t need to. Apart from giving papers at gatherings of the learned, the task of publicising one’s work belonged to the publisher. And if one’s publisher was the right one, then the work would inevitably end up in the hands of the small group of people who needed to know about it. And whilst the media don is not a new phenomenon, most historians might have thought such self-publicity outside the academy something of an embarrassment, even rather vulgar.

How times change. Universities are training their staff in dealing with the traditional media and in the most effective way of using social media. And this opens up a new category of data that ought to be archived, if only to understand how the push for ‘impact’ actually played out in these early years. And some of it is being archived. The Library of Congress are archiving every tweet, although it isn’t yet clear how that archive may be made available for use. The UK Web Archive, along with other national web archives, have been archiving selected blogs (including this one) for several years, and the EU-funded BlogForever project is looking to join those projects up. But this approach, valuable though it is, separates the content from the author, and from the rest of their digital archive. Whilst that link might be retrievable at a higher discovery layer, something important is still lost.

But now the helpful folk at Twitter, in a move that ought to be applauded, have made it very quick and easy to download an archive of one’s own tweets, right back to the beginning. And so I did: 1682 tweets, over 14.5 months. But what to do with it ?

Straight away, scrolling through a long CSV file starts to tell the story of the making of other things: the first retweet of someone else’s work which was subsequently to influence my own; the first traces of an idea, or even of a question I was beginning to ask, which spawned a blog post, and then a paper. I also find that I shared at least one link in more than two thirds of my tweets, which sounds public-spirited until I add that a good proportion were my own posts. I can start mining the data for key terms and themes, and how they ebbed and flowed.

It would be useful if there was a way to keep this data fresh, of course, to avoid going back to Twitter for a new download every so often. And, thanks to @mhawksey, there is a simple way of doing this, using Google Drive. Martin explains all here, with a handy video set-up guide.tweet archive

And so I now have a cloud-based archive of my tweets, complete with a basic search and browse web interface. This is now a lazy man’s look-up of old tweets and the resources they pointed to, searchable by handle, hashtag or key term.

But perhaps this is something about which most people are lazy. Social media provides us with an overwhelming stream of quite-interesting things, in amongst which are nuggets of gold. Those nuggets I can manage in the old way, by recording them properly, perhaps in a bibliography. I might even read them, one day. But the quite-interesting stuff, whilst being too much ever to record properly, will probably remain quite interesting. And so this provides a middle way between formal curation of a webliography and just searching the live web (which assumes I can remember enough about what I’m looking for.)

Might this archive now change my future tweeting ? Early days to judge perhaps. But I think it may, since I may now retweet and share in preference to using favourites, in order to get a link to a resource into the archive. I can also imagine starting to use personal hashtags, as a way of structuring my own archive at the same time as I tweet. Real-time curation perhaps ?

And I might share it too. Since this is now unambiguously my own data, rather than Twitter’s, I can licence it for reuse by others in larger corpora for analysis. Imagine a pooled archive of the tweets of many historians. Now that would be interesting.