British blogs in the web archive: some data

While working on another project, I’ve had occasion to make some data relating to the blog aggregator site britishblogs.co.uk  (now apparently defunct) which occurs in the Internet Archive between 2006 and 2012. I am unlikely to exploit it very much myself, and so I have made it available in figshare, in case it should be of use to anyone else.

Specifically, it is data derived from the UK Host Link Graph, which states the presence of links from one host to another in the JISC UK Web Domain Dataset (1996-2010), a dataset of archived web content for the UK country code top level domain captured by the Internet Archive.

It has 19,423 individual lines, each expressing one host-to-host linkage in content from a single year.

Since the blog as a format seems to be particularly prone to disappearance over time, scholars of the British blogosphere may find this useful in locating now defunct blogs in the Internet Archive or elsewhere. My sense is that the blogs included in the aggregator were nominated by their authors as being British, and so this may be of some help in identifying British content in platforms such as WordPress or Blogger.

Some words of caution. The data is offered very much as-is, without any guarantees about robustness or significance. In particular:

(i) I have made no effort to de-duplicate where the Internet Archive crawled the site, or parts of it, more than once in a single year.

(ii) also present are a certain number of inbound links – that is to say, other hosts linking to britishblogs.co.uk. However, these are very much the minority.

(iii) there is also some analysis needed in understanding which links are to blogs, and which are to content linked to from within those blogs (and aggregated by British Blogs).

 

Web archives: a new class of primary source for historians ?

On June 11th I gave a short paper at the Digital History seminar at the Institute of Historical Research, looking at the implications of web archives for historical practice, and introducing some of the work I’ve been doing (at the British Library) with the JISC-funded Analytical Access to the Domain Dark Archive project. It picked up on themes in a previous post here.

There is also an audio version here at HistorySpot along with the second paper in the session, given by Richard Deswarte.

The abstract (for the two papers together) reads:

When viewed in historical context, the speed at which the world wide web has become fundamental to the exchange of information is perhaps unprecedented. The Internet Archive began its work in archiving the web in 1996, and since then national libraries and other memory institutions have followed suit in archiving the web along national or thematic lines. However, whilst scholars of the web as a system have been quick to embrace archived web materials as the stuff of their scholarship, historians have been slower in thinking through the nature and possible uses of a new class of primary source.

“In April 2013 the six legal deposit libraries for the UK were granted powers to archive the whole of the UK web domain, in parallel with the historic right of legal deposit for print. As such, over time there will be a near-comprehensive archive of the UK web available for historical analysis, which will grow and grow in value as the span of time it covers lengthens. This paper introduces the JISC-funded AADDA (Analytical Access to the Domain Dark Archive) project. Led by the Institute of Historical Research (IHR) in partnership with the British Library and the University of Cambridge, AADDA seeks to demonstrate the value of longitudinal web archives by means of the JISC UK Web Domain Dataset. This dataset includes the holdings of the Internet Archive for the UK for the period 1996-2010, purchased by the JISC and placed in the care of the British Library. The project has brought together scholars from the humanities and social sciences in order to begin to imagine what scholarly enquiry with assets such as these would look like.