A theme that emerged for me in the IIPC web archiving conference in Reykjavik last week was metadata, and specifically: precisely which metadata do users of web archives need in order to understand the material they are using?
At one level, a precise answer to this will only come from sustained and detailed engagement with users themselves; research which I would very much hope that the IIPC would see as part of its role to stimulate, organise and indeed fund. But that takes time, and at present, most users understand the nature of the web archiving process only rather vaguely. As a result, I suspect that without the right kind of engagement, scholars are likely (as Matthew Weber noted) to default to ‘we need everything’, or if asked directly ‘what metadata do you need?’ may well answer ‘well, what do you have, and what would it tell me?’
During my own paper I referred to the issue, and was asked by a member of the audience if I could say what such enhanced metadata provision might look like. What I offer here is the first draft of an answer: a five-part scheme of kinds of metadata and documentation that may be needed (or at least, that I myself would need). I could hardly imagine this would meet every user requirement; but it’s a start.
At the very broadest level, users need to know something of the history of the collecting organisation, and how web archiving has become part of its mission and purpose. I hope to provide a overview of aspects of this on a world scale in this forthcoming article on the recent history of web archiving.
2. Domain or broad crawl
Periodic archiving of a whole national domain under legal deposit provisions now offers the prospect of the kind of aggregate analysis that takes us way beyond single-resource views in Wayback. But it becomes absolutely vital to know certain things at a crawl level. How was territoriality determined – by ccTLD, domain registration, Geo-IP lookup, curatorial decision? The way the national web sphere is defined fundamentally shapes the way in which we can analyse it. How big was the crawl in relation to previous years? How many domains are new, and how many have disappeared? What’s the policy on robots.txt (by default) ? How deep was the crawl scope (by default)? Was there a data cap per host? Some of this will already be articulated in internal documents, some will need some additional data analysis; but it all goes to the heart of how we might read the national web sphere as a whole.
3. Curated collection level
Many web archives have extensive curated collections on particular themes or events. These are a great means of showcasing the value of web archives to the public and to those who hold the pursestrings. But if not transparently documented they present some difficulties to the user trying to interpret them, as the process introduced a level of human judgment to add to the more technical decisions that I outlined above. In order to evaluate the collection as a whole, scholars really do need to know the selection criteria, and at a more detailed level than is often provided right now. In particular, in cases where permissions were requested for sites but not received, being able to access the whole list of sites selected rather than just those that were successfully archived would help a great deal in understanding the way in which a collection was made.
4. Host/domain level
This is the level at which a great deal of effort is expended to create metadata that looks very much like a traditional catalogue record: subject keywords, free-text descriptions and the like. For me, it would be important to know when the first attempt to crawl a host was, and the most recent, and whether there were 404 responses received for crawl attempts at any time in between. Was this host capped (or uncapped) at the discretion of a curator differentially to the policy for a crawl as a whole? Similarly, was the crawl scoping different, or the policy on robots.txt? If the crawl incorporates a GeoIP check, what was the result? Which other domains has it redirected to, and which redirect to it, and which times?
5. Individual resource level
Finally, there are some useful things to know about individual resources. As at the host level, information about the date of the first and last attempts to crawl, and about intervening 404s, would tell the user useful things about what we might call the career of a resource. If the resource changes, what is the profile of that: for instance, how has the file size changed over time? Were there other captures which were rejected, perhaps on a QA basis, and if so, when?
Much if not quite all of this could be based on data which is widely collected already (in policy documents, or curator tools, crawl logs or CDX) or could be with some adjustment. It presents some very significant GUI design challenges in how best to deliver these data to users. Some might be better delivered as datasets for download or via an API. What I hope to have provided, though, is a first sketch of an agenda for what the next generation of access services might disclose, that is not a default to ‘everything’ and is feasible given the tools in use.