New article: On digital contemporary history

A little article of mine has just appeared in the Danish historical journal Temp, based on a lecture given in Copenhagen to the Danish Assocation for Research in Contemporary History in January 2016.

It suggests that there has been a relative lack of digitally enabled historical research on the recent past, when compared to earlier periods of history. It explores why this might be the case, focussing in particular on both the obstacles and some missing drivers to mass digitisation of primary sources for the 20th century. It suggests that the situation is likely to change, and relatively soon, as a result of the increasing availability of sources that were born digital, and of Web archives in particular. The article ends with some reflections on several shifts in method and approach, which that changed situation is likely to entail.

By the kind permission of the editor, I make it available here.

Title:  Digital contemporary history: sources, tools, methods, issues
Details: Temp: Tidsskrift for historie, 14 (2017), 30-38.
Download the PDF

Welcoming the new Journal of Open Humanities Data

After some months in the making, I am delighted to be able to draw attention to the new Journal of Open Humanities Data. I’m particularly pleased to be a member of the editorial board.

Fully peer-reviewed, JOHD carries “publications describing humanities data or techniques with high potential for reuse.”

The journal accepts two kinds of papers:

“1. Metapapers, that describe humanities research objects with high reuse potential. This might include quantitative and qualitative data, software, algorithms, maps, simulations, ontologies etc. These are short (1000 word) highly structured narratives and must conform to the Metapaper template.

“2. Full length research papers that describe different methods used to create, process, evaluate, or curate humanities research objects. These are intended to be longer narratives (3,000 – 5,000 words) which give authors the ability to describe a research object and its creation in greater detail than a traditional publication.

For more detail, see the JOHD at Ubiquity Press.

Method in the web archive for the arts and humanities: a conference report

[In early December 2014 the Big UK Domain Data for the Arts and Humanities project held an excellent day conference on the theme of web archives as big data. A good part of the day was taken up with short presentations from the project’s bursary holders, arts and humanities scholars all, reflecting both on their substantive research findings, the experience of using the prototype user interface (developed by the BL) and on web archives as source material in general.
In early 2015 these results will appear on the BUDDAH project blog as a series of reports. This post reflects on some common methodological themes that emerged during the course of the day. A version of this was also posted on the project blog. Details of the projects are to be found also on the BUDDAH blog.]

Perhaps the single most prominent note of the whole day was of the sheer size of the archive. “Too much data!” was a common cry heard during the project, and with good reason, since there are few other archives in common use with data of this magnitude, at least amongst those used by humanists. In an archive with more than 2 billion resources recorded in the index, the researchers found that queries needed to be a great deal more specific than most users are accustomed to; and that even the slightest ambiguity in the choice of search terms in particular led very quickly to results sets containing many thousands of results. Gareth Millward (@MillieQED) also drew attention to the difficulties in interpreting patterns in the incidence of any but the most specific search terms across time across the whole dataset, since almost any search term a user can imagine may have more than one meaning in an archive of the whole UK web.

One common strategy to come to terms with the size of the archive was to “think small”: to explore some very big data by means of a series of small case studies, which could then be articulated together. Harry Raffal, for example, focussed on a succession of captures of a small set of key pages in the Ministry of Defence’s web estate; Helen Taylor on a close reading of the evolution of the content and structure of certain key poetry sites as they changed over time. This approach had much in common with that of Saskia Huc-Hepher on the habitus of the London French community as reflected in a number of key blogs. Rowan Aust also read important things from the presence and absence of content in the BBC’s web estate in the wake of the Jimmy Saville scandal.

An encouraging aspect of the presentations was the methodological holism on display, with this particular dataset being used in conjunction with other web archives, notably the Internet Archive. In the case of Marta Musso’s work on the evolution of the corporate web space, this data was but one part of a broader enquiry employing questionnaire and other evidence in order to create a rounded picture.

One particular and key difference between this prototype interface and other familiar services is that search results in the UI are not prioritised by any algorithmic intervention, but are presented in the archival order. This brought into focus one of the recurrent questions in the project: in the context of superabundant data, how attached is the typical user to a search service that (as it were) second-guesses what it was that the user *really* wanted to ask, and presents results in that order? If such a service is what is required, then how transparent must the operation of the algorithm be in order to be trusted ? Richard Deswarte (@CanadianRichard) powerfully drew attention to how fundamental has been the effect of Google on user expectations of the interfaces they use.

Somewhat surprisingly (at least for me), more than one of the speakers was prepared to accept results without such machine prioritisation: indeed, in some senses it was preferable to be able to utilise what Saskia Huc-Hepher described as the “objective power of arbitrariness”. If a query produced more results than could be inspected individually, then both Saskia and Rona Cran were more comfortable with making their own decisions about taking smaller samples from those results than relying on a closed algorithm to make that selection. In a manner strikingly akin to the functionality of the physical library, such arbitrariness also led on occasion to a creative serendipitous juxtaposition of resources: a kind of collage in the web archive.

Book review: The Future of Scholarly Communication (Shorley and Jubb)

[This review appeared in the 24 July issue of Research Fortnight, and is reposted here by kind permission. For subscribers, it is also available here.]

Perhaps the one thing on which all the contributors to this volume could agree is that scholarly communication is changing, and quickly. As such, it is a brave publisher that commits to a collection such as this — in print alone, moreover. Such reflections risk being outdated before the ink dries.

The risk has been particularly acute in the last year, as policy announcements from government, funders, publishers and learned societies have come thick and fast as the implications of the Finch report, published in the summer of 2012, have been worked out. It’s a sign of this book’s lead time that it mentions Finch only twice, and briefly. That said, Michael Jubb, director of the Research Information Network, and Deborah Shorley, Scholarly Communications Adviser at Imperial College London, are to be congratulated for having assembled a collection that, even if it may not hold many surprises, is an excellent introduction to the issues. By and large, the contributions are clear and concise, and Jubb’s introduction is a model of lucidity and balance that would have merited publication in its own right as a summation of the current state of play.

As might be expected, there is much here about Open Access. Following Finch, the momentum towards making all publications stemming from publicly funded research free at the point of use is probably unstoppable. This necessitates a radical reconstruction of business models for publishers, and similarly fundamental change in working practices for scholars, journal editors and research libraries. Here Richard Bennett of Mendeley, the academic social network and reference manager recently acquired by Elsevier, gives the commercial publisher’s point of the view, while Mike McGrath gives a journal editor’s perspective that is as pugnacious as Bennett’s is anodyne. Robert Kiley writes on research funders, with particular reference to the Wellcome Trust, where he is head of digital services. Together with Jubb’s introduction and Mark Brown’s contribution on research libraries these pieces give a clear introduction to hotly contested issues.

There is welcome acknowledgement here that there are different forces at work in different disciplines, with STM being a good deal further on in implementing Open Access than the humanities. That said, all authors concentrate almost exclusively on the journal article, with little attention given to other formats, including the edited collection of essays, the textbook and — particularly crucial for the humanities — the monograph.

Thankfully, there’s more to scholarly communication than Open Access. The older linear process, where research resulted in a single fixed publication, disseminated to trusted repositories, libraries, that acted as the sole conduits of that work to scholars is breaking down. Research is increasingly communicated while it is in progress, with users contributing to the data on which research is based at every stage.

Fiona Courage and Jane Harvell provide a case study of the interaction between humanists and social scientists and their data from the long-established Mass Observation Archive. The availability of data in itself is prompting creative thinking about the nature of the published output: here, John Wood writes on how the data on which an article is founded can increasingly be integrated with the text. And the need to manage access to research data is one of several factors prompting a widening of the traditional scope of the research library.

Besides the changing roles of libraries and publishers, social media is allowing scholars themselves to become more active in how their work is communicated. Ellen Collins, also of RIN, explores the use of social media as means of sharing and finding information about research in progress or when formally published, and indeed as a supplementary or even alternative method of publication, particularly when reaching out to non-traditional audiences.

Collins also argues that so far social media have mimicked existing patterns of communication rather than disrupting them. She’s one of several authors injecting a note of cold realism that balances the technophile utopianism that can creep into collections of this kind. Katie Anders and Liz Elvidge, for example, note that researchers’ incentives to communicate creatively remain weak and indirect in comparison to the brute need to publish or perish. Similarly, David Prosser observes that research communication continues to look rather traditional because the mechanisms by which scholarship is rewarded have not changed, and those imperatives still outweigh the need for communication.

This collection expertly outlines the key areas of flux and uncertainty in scholarly communication. Since many of the issues will only be settled by major interventions by governments and research funders, this volume makes only as many firm predictions as one could expect. However, readers in need of a map to the terrain could do much worse than to start here.

[The Future of Scholarly Communication, edited by Deborah Shorley and Michael Jubb, is published by Facet, at £49.95.]

On historians’ electronic ‘papers’

[This post was written for inclusion in the blog of the Institute of Historical Research’s winter conference on History and Biography]

I should say straight away that I am neither an archivist, nor a specialist in digital preservation (in its strict sense.) But I am an historian, and professionally interested in the impact of the digital on our working practices; and during the working day I am on the staff of one of the UK’s main memory institutions. And I’m pleased to have been asked to write this piece by my former colleagues at the IHR, as while there is much going on at present relating to the management of research data, there is much less (that I know of) about the private papers of scholars. What is the infrastructure for preserving these materials, of historians, for historians? Is there even an infrastructure worth the name?

Straight away, there is a problem of definition – of distinguishing between what we might call research data and private papers. In the physical sciences, it is easier to spot the data; lots of numbers in tables, on computers, as opposed to the reams of transcribed or part-transcribed primary sources that I still have from my own Ph.D. And in the physical sciences there has been a much stronger culture of the re-use of data by other scholars. In order to test and refine a hypothesis, it helps to be able to repeat experiments, and for that you need the data. And so that data tends to be ‘cleaner’ – well-defined and structured, with appropriate documentation – and thus easier to share. And so there are services such as Dryad, a discipline-specific data repository designed for specifically this purpose.

Historians have been much less accustomed to this way of working. This is partly because our ‘data’ tends to be angular, asymmetric texts that resist being squashed into anything so restrictive as a table. And there is an attachment among many to the thick description of each source and all its meaning, particular to a time, a place and an individual, and a resistance to abstraction. (To paraphrase J. H. Hexter, the splitters tend to dominate the lumpers.) There are exceptions, but the cliometric urge is not as strong as once it was.

This attachment to the particular is something to be cherished, but I would argue that there is yet more scope for historians to think of their working materials as data, and thus as something that may be shared and re-used. The Old Bailey Proceedings Online is a fine example of a corpus of freely composed texts that has within it a dataset. Not all sources have the degree of regularity of structure that a set of court records has; but there is still much material that languishes on desktop machines that might be set free. But it would require us to think about reuse at the beginning of a project, rather than at the end.

And as well as primary sources that might be shared and reused, there is the question of an historian’s intermediate working materials, that mark the stages by which primary sources are digested and turned into writing. The London Review of Books recently published Keith Thomas’ account of his own working method, thousands of bulging white envelopes full of notes; Christopher Hill was famous for his system of index cards. As evidence of the working practices of a discipline, these paper systems are an artefact to be preserved. As scholars increasingly move to digital systems of managing notes and bibliography, some using proprietary software and some the cloud, we also need to think about how these are best preserved as evidence of how the discipline worked at a particular point in time.
And finally, there is writing. Historians of a certain age will remember a device commonly known as a ‘typewriter’ which impressed characters on a sheet of paper, by a mechanism operated by the pressing of keys. (You can see examples in museums sometimes.) And the use of the typewriter meant that, for every iteration of a piece of writing, there was a physical record. (The typewriter was, as it were, sub-optimally featured for corrections.) The ease of emendation of a word-processed document probably means that these intermediate versions no longer exist. But where they do (and I myself tend to keep numbered versions of articles to reflect each revision), they are a valuable record of the evolution of a piece of writing and the thinking that supports it, and part of intellectual biography.

But who should be preserving these materials? In the past, for the most prominent, an existing connection with an institution tended to lead to their papers being held there: the papers of Noel Annan now reside at King’s College, Cambridge, of which he was Fellow and later Provost; those of E. H. Gombrich at the Warburg Institute at which he spent most of his career. The British Library also receives a certain number of digital archives, but mostly from prominent literary figures, such as the recent deposit from the poet Wendy Cope. But there is a need for a more scaleable solution. Part of this is certainly the recent ventures in services that enable personal digital archiving. But these tend to require a certain level of skill in the issues involved (and for one to be not yet dead) and so there is a place in this new ecology of preservation for organisations, such as the IHR, with an established presence as a repository and clearing-house for a discipline. And as collections of discipline-specific materials grow over time, those collections would become in themselves more than the sum of their parts – part of the stuff of a laboratory for the history of history.

[Picture via matthewtlynch on Flickr, CC BY-NC-SA]