http://peterwebster.me/2014/01/28/distant-reading-the-webarchive/

On the number of times each site links out: there are two ways into this. One is to look at *for how long* a linkage persists (which one can do from this data, but I haven’t yet). The second is to look at how many links there were at any one time from one host to another. These numbers are declared in the data, but would only be meaningful if you could also understand how many pages the linking site had in total. That is rather more difficult, and involves some triangulation with other data that the BL provides.

]]>Specifically, resources found in 2009 and in 2010 that had not changed since the last time they were archived were not included in the index. As a result, the single resource in the archive which gives rise to:

2008 | host1.co.uk | host2.co.uk | 1

will not show in this data as

2009 | host1.co.uk | host2.co.uk | 1

if the resource had not changed.

Similarly, the difference between the following two statements may be accounted for by the same deduplication:

2008 | host3.co.uk | host4.co.uk | 100

2009 | host3.co.uk | host4.co.uk | 1

There may still be 100 linking resources in 2009, but 99 of them may be unchanged.

]]>