Understanding the shape of the Anglo-Irish web: a pilot project

I’m delighted to be able to say that I shall be a Visiting Research Fellow at the Moore Institute of the National University of Ireland at Galway in 2015. Here are some details of what I plan to get up to.

The task of understanding what constitutes the nation in the web archive is only in its infancy. Web archivists in national libraries have long known that top-level domains such as .uk or .ie do not encompass all the content that should be considered British or Irish for the purposes of analysis. But even the task of understanding the shape of those top-level domains has only just begun. My project begins that process for the Irish web.

One of the live questions about the nature of the national web is the degree to which it interacts with other national domains. This is of particular interest in the Irish context, since many institutions on the island of Ireland interact in cyberspace in ways that do not respect the physical and political border between Northern Ireland and the Republic.

This pilot study will begin to examine this interaction by the triangulation of analyses of data available from the Internet Archive and from the British Library. In particular, the data from the British Library lists all of the outbound links in the .uk webspace for the period 1996-2010 (see this earlier post). Such a dataset does not exist for the Irish webspace, but by analysing the composition of links from .uk sites to those in the .ie domain, it will be possible to read the growth and composition of the Irish webspace in its reflection in the UK. It will also shed valuable and hitherto unseen light on one aspect of the relation between the UK and the Republic of Ireland.

The initial outputs will be a series of small case studies, documented on this blog. Over time, these will be synthesised into an appropriate article or articles. I also plan to make subsets of the data available for reuse by other scholars.

The Big UK Domain Data for the Arts and Humanities project has shown an appetite amongst humanities and social sciences scholars to understand the content of web archives, and also to understand the methodological implications of working with what amounts to a new class of primary source. I intend to use the period of the Visiting Fellowship to engage with scholars across the humanities and social sciences at NUI Galway and in other Irish universities, with a view to sowing the seeds of a community of scholars interested in exploring the archive of the Irish webspace.

Interview: doing a PhD in history

Some years ago (in 2010, I think) I gave an interview about the experience of doing a Ph.D., as part of an Institute of Historical Research project on the past and future of the history doctorate. For completeness, I make it available here.
Amongst other things, it reflects how different my Ph.D. would have been had resources such as Early English Books Online been available in 1998. The interviewer is the redoubtable Danny Millum (@ReviewsHistory)

Review: Society and the Internet

Earlier this month I wrote again for the LSE Review of Books. Since the Review is admirably free in the reuse it will allow, I republish it here under a Creative Commons licence.

Society and the Internet: How Networks of Information and Communication are Changing our Lives.
Mark Graham and William H. Dutton (eds.)
Oxford University Press, 2014.

The word ‘revolution’ is at a discount when it comes to discussing the impact of the internet, but current reactions to what is undoubtedly far-reaching and permanent change fit a longer pattern. Societies in the midst of rapid technological change often perceive the change as both radical and unprecedented. Previous technological shifts in communication have before been greeted in the same way as the internet, being understood in terms of utopia and dystopia. For some, the internet is a new technology in the vanguard of the inexorable progress of such abstract nouns as Freedom and Democracy. It dissolves the power of old elites, putting the power to communicate, publish, mobilize and do business in the hands of any who should want it. For others, it provides dark corners in which criminality may flourish out of reach of traditional law enforcement. It undermines the business models of cherished institutions, saps our powers of concentration, and indeed threatens the alteration of our very brains in none-too-positive ways.

These two mutually contradictory narratives have one trait in common: a naïve technological determinism. Both stories radically overestimate the degree to which new technologies have inherent dynamics in single and obvious directions, and similarly underestimate the force of the social, economic and political contexts in which real human beings design, implement and use new applications to serve existing needs and desires. It is the great strength of this stimulating collection of essays that at every turn it brings such high-flown imaginings back to the bench of empirical research on the observable behaviours of people and the information systems they use. Given the rapidity of the changes under discussion – the commercialised internet is only now reaching the age of an undergraduate student, as it were, with social media still in junior school – this kind of very contemporary history meets sociology, geography, computer science and many other disciplines in a still fluid interdisciplinary space.

The volume is very much the product of the Oxford Internet Institute, with all but six of the thirty-one contributors being associated with the institute in some way. The twenty-three essays are arranged into five thematic sections: everyday life; information and culture, politics and governments; business, industry and economics; and internet regulation and governance. Whilst the grouping is convenient as an orientation to the reader, the effect of the book is best experienced as a whole, as several themes emerge again and again. In this review I examine just three of many such themes.

One such is the complex geographies of the web. Gillian Bolsover and colleagues examine the shifting geographic centre of gravity of internet use. The proportion of total users who were located in the United States fell from two thirds to one third in a decade, and the proportion in Asia grew from a tiny 5% to nearly half over the same period. Bolsover and colleagues find that this shift in numbers is accompanied by distinctive geographic variations in the uses that users make of their internet, and attitudes to its regulation. Reading this chapter in conjunction with that by Mark Graham would suggest that these patterns of use map only loosely onto patterns of knowledge production (the “digital division of labour” between nations). These patterns of production in turn relate only inexactly with patterns of representation of places online; the “data shadows” fall unevenly. That said, the Global South both produces a small proportion of the content online, and is itself underrepresented as the subject of that content.

Many businesses, and media businesses in particular, have found the last ten years a time of particular uncertainty about the impact of the internet on long-established ways of doing business. Economists will be interested in two chapters which seek to address some of these issues. Sung Wook Ji and David Waterman examine the recent history of media companies in the United States, and point out a steady fall in revenues, and a shift from a reliance on revenue from advertising, to direct payment by consumers. Greg Taylor’s valuable essay examines the ending of the traditional economic difficulty of scarcity of goods by the advent of an almost limitless abundance of content online. This has created a different theoretical problem to be understood: the scarcity of attention that consumers can pay to that content.

Perhaps the most coherent section in the book is that on government and politics. Several governments (mostly amongst those western nations that were the early adopters of the internet) have placed considerable hope on online delivery of government services, and on social media as new means of engagement with voters. At the same time, both the chapters by Margetts, Hale and Yasseri, and by Dubois and Dutton examine the uses made by individuals of electronic means to organise and to influence government independently of, and indeed in opposition to, the agenda of that government. Governments have often expected greater benefits and lower costs from e-government; and political activists have tended to lionise the role of the self-organising ‘Fifth Estate’ of networked individuals to which Dubois and Dutton point. These five chapters situate all these hopes firmly in empirical examination of the interaction of politics, culture and technology in specific contexts.

Individually, the essays in this volume are uniformly strong: lucid, cogent and concise, and accompanied with useful lists of further reading. As a whole, the volume prompts fertile reflections on the method and purpose of the new discipline of Internet Studies. The volume will be of great interest to readers in many disciplines and at all levels from undergraduate upwards.

The ethics of search filtering and big data: who decides ?

[Reflecting on discussions at the recent UK Internet Policy Forum, this post argues that societies as moral communities need to take a greater share in the decision-making about controversial issues on the web, such as search filtering and the use of open data. It won’t do to expect tech companies and data collectors to settle questions of ethics.]

Last week I was part of the large and engaged audience at the UK Internet Policy Forum meeting, convened by Nominet. The theme was ‘the Open Internet and the Digital Economy’, and the sessions I attended were on filtering and archiving, and on the uses of Big Data. And the two were bound together by a common underlying theme.

That theme was the relative responsibilities of tech providers, end users and government (and regulators, and legislators) to solve difficult issues of principle: of what should (and should not) be available through search; and which data about persons should truly be regarded as personal, and how they should be used.

On search: last autumn there was a wave of public, and then political concern about the risk of child pornography being available via search engine results. Something Should Be Done, it was said. But the issue – child pornography – was so emotive, and legally so clear-cut, that important distinctions were not clearly articulated. The production and distribution of images of this kind would clearly be in contravention of the law, even if no-one were ever to view them. And a recurring theme during the day was that these cases were (relatively) straightforward – if someone shows up with a court order, search engines will remove that content from their results, for all users; so will the British Library remove archived versions of that content from the UK Legal Deposit Web Archive.

Monitor padlock
But there are several classes of other web content about which no court order could be obtained. Content may well directly or indirectly cause harm to those who view it. But because that chain of causation is so dependent on context and so individual, no parliament could legislate in advance to stop the harm occurring, and no algorithm could hope to predict that harm would be caused. I myself am not harmed by a site that provides instructions on how to take one’s own life; but others may well be. There is also another broad category of content which causes no immediate and directly attributable harm, but might in the longer term conduce to a change in behaviour (violent movies, for instance). There is also content which may well cause distress or offence (but not harm); on religious grounds, say. No search provider can be expected to intuit which elements of this content should be removed entirely from search, or suggest to end users as the kind of thing they might not want to see.

These decisions need to be taken at a higher level and in more general terms. It depends on the existence of the kind of moral consensus which was clearly visible at earlier times in British history, but which has become weakened if not entirely destroyed since the ‘permissive’ legislation of the Sixties. The system of theatre censorship was abolished in the UK in 1968 because it had become obvious that there was no public consensus that it was necessary or desirable. A similar story could be told about the decriminalisation of male homosexuality in 1967, or the reform of the law on blasphemy in 2008. As Dave Coplin of Microsoft put it, we need to decide collectively what kind of society we want; once we know that, we can legislate for it, and the technology will follow.

The second session revolved around the issue of big data and privacy. Much can be dealt with by getting the nature of informed consent correct, although it is hard to know what ‘informed’ means; difficult to imagine in advance all the possible uses that data might be used, in order both to put and to answer the question ‘Do you consent?’.

But once again, the issues are wider than this, and it isn’t enough to declare that privacy must come first, as if this settled the issue. As Gilad Rosner suggested, the notion of personal data is not stable over time, or consistent between cultures. The terms of use of each of the world’s web archives are different, because different cultures have privileged different types of data as being ‘private’ or ‘personal’ or ‘sensitive’. Some cultures focus more on data about one’s health, or sexuality, or physical location, or travel, or mobile phone usage, or shopping patterns, or trade union membership, or religious affiliation, or postal address, or voting record and political party membership, or disability. None of these categories is self-evidently more or less sensitive than any of the others, and – again – these are decisions that need to be determined by society at large.

Tech companies and data collectors have responsibilities – to be transparent about the data they do have, and to co-operate quickly with law enforcement. They also must be part of the public conversation about where all these lines should be drawn, because public debate will never spontaneously anticipate all the possible use cases which need to be taken into account. In this we need their help. But ultimately, the decisions about what we do and don’t want must rest with us, collectively.

Introducing Web Archives for Historians

WebArchivesforHistoriansIt was a great pleasure last week, after several months, to be able to unveil Web Archives for Historians, a joint project with the excellent Ian Milligan of the University of Waterloo.

The premise is simple. We’re looking to crowd-source a bibliography of research and writing by historians who use or think about the making or use of web archives. Here’s what the site has to say:

“We want to know about works written by historians covering topics such as: (a) reflections on the need for web preservation, and its current state in different countries and globally as a whole; (b) how historians could, should or should not use web archives; (c) examples of actual uses of web archives as primary sources..”

Ian and I had been struck by just how few historians we knew of who were beginning to use web archives as primary sources, and how little there has been written on the topic. We aimed to provide a resource for historians who are getting interested in the topic, to publicise their work and find that of others.

It can include formal research articles or book chapters, but also substantial blog posts and conference papers, which we think reflects the diverse ways in which this type of work is likely to be communicated.

So: please do submit a title, or view the bibliography to date (which is shared on a Creative Commons basis). You can also sign up to express a general interest in the area. These details won’t be shared publicly, but you might just occasionally hear by email of interesting developments as and when we hear of them.

You can also find the project on Twitter @HistWebArchives

Reading old news in the web archive, distantly

One of the defining moments of Rowan Williams’ time as archbishop of Canterbury was the public reaction to his lecture in February 2008 on the interaction between English family law and Islamic shari’a law. As well as focussing attention on real and persistent issues of the interaction of secular law and religious practice, it also prompted much comment on the place of the Church of England in public life, the role of the archbishop, and on Williams personally. I tried to record a sample of the discussion in an earlier post.

Of course, a great deal of the media firestorm happened online. I want to take the episode as an example of the types of analysis that the systematic archiving of the web now makes possible: a new kind of what Franco Moretti called ‘distant reading.’

The British Library holds a copy of the holdings of the Internet Archive for the .uk top level domain for the period 1996-2010. One of the secondary datasets that the Library has made available is the Host Link Graph. With this data, it’s possible to begin examining how different parts of the UK web space referred to others. Which hosts linked to others, and from when until when ?

This graph shows the total number of unique hosts that were found linking at least once to archbishopofcanterbury.org in each year.

Canterbury unique linking hosts - bar

My hypothesis was that there should be more unique hosts linking to the archbishop’s site after February 2008, which is by and large borne out. The figure for 2008 is nearly 50% higher than for the previous year, and nearly 25% higher than the previous peak in 2004. This would suggest that a significant number of hosts that had not previously linked to the Canterbury site did so in 2008, quite possibly in reaction to the shari’a story.

What I had not expected to see was the total number fall back to trend in 2009 and 2010. I had rather expected to see the absolute numbers rise in 2008 and then stay at similar levels – that is, to see the links persist. The drop suggests that either large numbers of sites were revised to remove links that were thought to be ‘ephemeral’ (that is to say, actively removed), or that there is a more general effect in that certain types of “news” content are not (in web archivist terms) self-archiving. [Update 02/07/2014 – see comment below ]

The next step is for me to look in detail at those domains that linked only once to Canterbury, in 2008, and to examine these questions in a more qualitative way. Here then is distant reading leading to close reading.

You can download the data, which is in the public domain, from here . Be sure to have plenty of hard disk space, as when unzipped the data is more than 120GB. The data looks like this:

2010 | churchtimes.co.uk | archbishopofcanterbury.org | 20

which tells you that in 2010, the Internet Archive captured 20 individual resources (usually, although not always, “pages”) in the Church Times site that linked to the archbishop’s site. My poor old laptop spent a whole night running through the dataset and extracting all the instances of the string “archbishopofcanterbury.org”.

Then I looked at the total numbers of unique hosts linking to the archbishop’s site in each year. In order to do so, I:

(i) stripped out those results which were outward links from a small number of captures of the archbishop’s site itself.

(ii) allowed for the occasions when the IA had captured the same host twice in a single year (which does not occur consistently from year to year.)

(iii) did not aggregate results for hosts that were part of a larger domain. This would have been easy to spot in the case of the larger media organisations such as the Guardian, which has multiple hosts (society,guardian.co.uk, education.guardian.co.uk, etc.) However, it is much harder to do reliably for all such cases without examining individual archived instances, which was not possible at this scale.


(i) that a host “abc.co.uk” held the same content as “www.abc.co.uk”.

(ii) that the Internet Archive were no more likely to miss hosts that linked to the Canterbury site than ones that did not – ie., if there are gaps in what the Internet Archive found, there is no reason to suppose that they systematically skew this particular analysis.

Book review: The Future of Scholarly Communication (Shorley and Jubb)

[This review appeared in the 24 July issue of Research Fortnight, and is reposted here by kind permission. For subscribers, it is also available here.]

Perhaps the one thing on which all the contributors to this volume could agree is that scholarly communication is changing, and quickly. As such, it is a brave publisher that commits to a collection such as this — in print alone, moreover. Such reflections risk being outdated before the ink dries.

The risk has been particularly acute in the last year, as policy announcements from government, funders, publishers and learned societies have come thick and fast as the implications of the Finch report, published in the summer of 2012, have been worked out. It’s a sign of this book’s lead time that it mentions Finch only twice, and briefly. That said, Michael Jubb, director of the Research Information Network, and Deborah Shorley, Scholarly Communications Adviser at Imperial College London, are to be congratulated for having assembled a collection that, even if it may not hold many surprises, is an excellent introduction to the issues. By and large, the contributions are clear and concise, and Jubb’s introduction is a model of lucidity and balance that would have merited publication in its own right as a summation of the current state of play.

As might be expected, there is much here about Open Access. Following Finch, the momentum towards making all publications stemming from publicly funded research free at the point of use is probably unstoppable. This necessitates a radical reconstruction of business models for publishers, and similarly fundamental change in working practices for scholars, journal editors and research libraries. Here Richard Bennett of Mendeley, the academic social network and reference manager recently acquired by Elsevier, gives the commercial publisher’s point of the view, while Mike McGrath gives a journal editor’s perspective that is as pugnacious as Bennett’s is anodyne. Robert Kiley writes on research funders, with particular reference to the Wellcome Trust, where he is head of digital services. Together with Jubb’s introduction and Mark Brown’s contribution on research libraries these pieces give a clear introduction to hotly contested issues.

There is welcome acknowledgement here that there are different forces at work in different disciplines, with STM being a good deal further on in implementing Open Access than the humanities. That said, all authors concentrate almost exclusively on the journal article, with little attention given to other formats, including the edited collection of essays, the textbook and — particularly crucial for the humanities — the monograph.

Thankfully, there’s more to scholarly communication than Open Access. The older linear process, where research resulted in a single fixed publication, disseminated to trusted repositories, libraries, that acted as the sole conduits of that work to scholars is breaking down. Research is increasingly communicated while it is in progress, with users contributing to the data on which research is based at every stage.

Fiona Courage and Jane Harvell provide a case study of the interaction between humanists and social scientists and their data from the long-established Mass Observation Archive. The availability of data in itself is prompting creative thinking about the nature of the published output: here, John Wood writes on how the data on which an article is founded can increasingly be integrated with the text. And the need to manage access to research data is one of several factors prompting a widening of the traditional scope of the research library.

Besides the changing roles of libraries and publishers, social media is allowing scholars themselves to become more active in how their work is communicated. Ellen Collins, also of RIN, explores the use of social media as means of sharing and finding information about research in progress or when formally published, and indeed as a supplementary or even alternative method of publication, particularly when reaching out to non-traditional audiences.

Collins also argues that so far social media have mimicked existing patterns of communication rather than disrupting them. She’s one of several authors injecting a note of cold realism that balances the technophile utopianism that can creep into collections of this kind. Katie Anders and Liz Elvidge, for example, note that researchers’ incentives to communicate creatively remain weak and indirect in comparison to the brute need to publish or perish. Similarly, David Prosser observes that research communication continues to look rather traditional because the mechanisms by which scholarship is rewarded have not changed, and those imperatives still outweigh the need for communication.

This collection expertly outlines the key areas of flux and uncertainty in scholarly communication. Since many of the issues will only be settled by major interventions by governments and research funders, this volume makes only as many firm predictions as one could expect. However, readers in need of a map to the terrain could do much worse than to start here.

[The Future of Scholarly Communication, edited by Deborah Shorley and Michael Jubb, is published by Facet, at £49.95.]