What do we need to know about the archived web?

A theme that emerged for me in the IIPC web archiving conference in Reykjavik last week was metadata, and specifically: precisely which metadata do users of web archives need in order to understand the material they are using?

At one level, a precise answer to this will only come from sustained and detailed engagement with users themselves; research which I would very much hope that the IIPC would see as part of its role to stimulate, organise and indeed fund. But that takes time, and at present, most users understand the nature of the web archiving process only rather vaguely. As a result, I suspect that without the right kind of engagement, scholars are likely (as Matthew Weber noted) to default to ‘we need everything’, or if asked directly ‘what metadata do you need?’ may well answer ‘well, what do you have, and what would it tell me?’

During my own paper I referred to the issue, and was asked by a member of the audience if I could say what such enhanced metadata provision might look like. What I offer here is the first draft of an answer: a five-part scheme of kinds of metadata and documentation that may be needed (or at least, that I myself would need). I could hardly imagine this would meet every user requirement; but it’s a start.

1. Institutional
At the very broadest level, users need to know something of the history of the collecting organisation, and how web archiving has become part of its mission and purpose. I hope to provide a overview of aspects of this on a world scale in this forthcoming article on the recent history of web archiving.

2. Domain or broad crawl
Periodic archiving of a whole national domain under legal deposit provisions now offers the prospect of the kind of aggregate analysis that takes us way beyond single-resource views in Wayback. But it becomes absolutely vital to know certain things at a crawl level. How was territoriality determined – by ccTLD, domain registration, Geo-IP lookup, curatorial decision? The way the national web sphere is defined fundamentally shapes the way in which we can analyse it. How big was the crawl in relation to previous years? How many domains are new, and how many have disappeared? What’s the policy on robots.txt (by default) ? How deep was the crawl scope (by default)? Was there a data cap per host? Some of this will already be articulated in internal documents, some will need some additional data analysis; but it all goes to the heart of how we might read the national web sphere as a whole.

3. Curated collection level
Many web archives have extensive curated collections on particular themes or events. These are a great means of showcasing the value of web archives to the public and to those who hold the pursestrings. But if not transparently documented they present some difficulties to the user trying to interpret them, as the process introduced a level of human judgment to add to the more technical decisions that I outlined above. In order to evaluate the collection as a whole, scholars really do need to know the selection criteria, and at a more detailed level than is often provided right now. In particular, in cases where permissions were requested for sites but not received, being able to access the whole list of sites selected rather than just those that were successfully archived would help a great deal in understanding the way in which a collection was made.

4. Host/domain level
This is the level at which a great deal of effort is expended to create metadata that looks very much like a traditional catalogue record: subject keywords, free-text descriptions and the like. For me, it would be important to know when the first attempt to crawl a host was, and the most recent, and whether there were 404 responses received for crawl attempts at any time in between. Was this host capped (or uncapped) at the discretion of a curator differentially to the policy for a crawl as a whole? Similarly, was the crawl scoping different, or the policy on robots.txt? If the crawl incorporates a GeoIP check, what was the result? Which other domains has it redirected to, and which redirect to it, and which times?

5. Individual resource level
Finally, there are some useful things to know about individual resources. As at the host level, information about the date of the first and last attempts to crawl, and about intervening 404s, would tell the user useful things about what we might call the career of a resource. If the resource changes, what is the profile of that: for instance, how has the file size changed over time? Were there other captures which were rejected, perhaps on a QA basis, and if so, when?

Much if not quite all of this could be based on data which is widely collected already (in policy documents, or curator tools, crawl logs or CDX) or could be with some adjustment. It presents some very significant GUI design challenges in how best to deliver these data to users. Some might be better delivered as datasets for download or via an API. What I hope to have provided, though, is a first sketch of an agenda for what the next generation of access services might disclose, that is not a default to ‘everything’ and is feasible given the tools in use.

4 thoughts on “What do we need to know about the archived web?

  1. Pingback: What do we need to know about the archived Web? | Web Archives for Historians

  2. Dear Peter,

    This is fantastic! I’m working on a project within the scope of the Dutch Network Digital Heritage. The aim of the network is to share knowledge and technology to facilitate the preservation, usability and accessibility of digital heritage.

    My project [1] concerns embedding the thesaurus of the Netherlands Institute of Sound and Vision within the DAM of another archive, namely Groninger Archieven. The reason they want to use the subjects from the thesaurus is that they themselves don’t have a proper thesaurus to annotate their web archive. Furthermore, their web archive is a relatively new asset, and they needed a metadata model to have it land in their DAM.

    For our project, we’ve opted to use a selection of fields from MODS Lite, the model that we picked after comparing it with eg MARC and METS. [2]. Right now, I’m evaluating the project and found your post looking for insights and ideas. The links I shared are for now unfortunately only available in Dutch (so I hope Google Translate doesn’t completely garble the meaning), but if you’re interested, I’d be happy to have a telco in English, or even write a summarising blog post.

    Have also sent information about my project to the the Web Archiving Metadata Working Group of OCLC [3], I’m guessing you’re part of this or have hear of it?

    Just wanted to share this information with you, let me know if you have any specific questions or comments.

    All the best,
    Lotte

    [1] http://www.den.nl/project/613/Annotatie-Webarchief-Groningen-op-basis-van-GTAA-en-OpenSKOS
    [2] https://docs.google.com/document/d/1pte4xY4GgV25fraa_sIdGUgJM7QVksf7HtaOfT8EHqg/edit
    [3] http://www.oclc.org/research/themes/research-collections/wam.html

  3. Defining a basic set of metadata is indeed a very challenging task. Thanks a lot for sharing your draft.

    Some questions:
    • What is your definition of
    ○ Collection level?
    ○ Host/domain level?
    ○ Individual resource level?
    • Which metadata fields are relevant for all levels? Which only for 1 or 2? E.g. crawl scope, dates, seeds (successfully/ unsuccessfully) crawled, subjects/keywords, … could be interesting on all levels.

    Thanks, Els

    • Hi Els, many thanks for your comment.

      In talking about Collections, I have in mind the very many national library web archives that are around particular topics: there are several election collections, for instance, others on the Olympics. It’s less relevant for an archive whose remit is to collect a known set of sites. Host/domain for me means what is commonly thought of as the “website”: for a blog, myblog.wordpress.com, or leave.eu (to take an unfortunate but topical example.) The resource is whichever thing is meaningful in the particular context: one HTML page, a PDF, an image.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s