Our Linked Data work has thrown up a significant number of challenges around the consistency and structure of the source data from the Archives Hub, and nowhere more so than around identifiers for the archival resources, that is, the references used for the archives at all levels of description, be it collection, series, file or item.
Identifiers on the Hub
Identifiers serve two distinct purposes on the Archives Hub:
(i) the identifier for the archive itself – the reference for the actual collection or sub-collection. This is contained within the ‘unitid’ tag.
(ii) the identifier for the description of that archive – the finding aid. This is contained within the ‘eadid’ tag.
a) Identifiers for the Description of the Archive
The eadid tag consistently contains attributes for the country and the agency that maintains the description. This information is also given within the content of the tag. The Hub URI is created by converting the reference to lower case, and converting slashes to dashes:
<eadid countrycode=”GB” mainagencycode=”1234″ identifier=”JaB/A”>GB 1234 JaB/A</eadid>
b) Identifiers for the Archive
We display the identifier for the archive within the description. So, to a degree, the way this identifier is structured in the data is less important, as long as we display it to the researcher.
The identifier for the archive is typically the same as the identifier for the description, including a code for the country and a code for the repository as well as the local identifier for the archive, although they serve different purposes:
Reference: GB 1234 JaB/A
At the top level, things can seem relatively straightforward.But, bear in mind that on the Hub the primary role of the ‘unitid’ reference is to be a visual indicator of the reference – the important thing is what displays to the end-user, so a level of inconsistency in the make-up of the unitid might not be a problem as long as we display the correct reference.
If you look behind the scenes, there is a lack of consistency in the structure of these identifiers. The country code and repository code may exist within the content (which is displayed), or they may exist as ‘attributes’ – which provide additional information that is not part of the content (and which can be displayed, but may not be), or they may exist as both. Occasionally they are not present at all.
For those that are familiar with XML markup, I mean that we could display a reference such as ‘GB 0982 UWA’, but there are various ways the data may be structured:
(1) <unitid countrycode=”gb” repositorycode=”0982″ identifier=”UWA”>GB 0982 UWA</unitid>
(2) <unitid>GB 0982 UWA</unitid>
(3) <unitid countrycode=”gb” repositorycode=”0982″>UWA</unitid>
Even if you are not familiar with XML, you can see the way that the content is the same (apart from the last example) but the way it is structured differs. However, as long as we can display ‘GB 0982 UWA’ on the Archives Hub we are OK with this. We have ensured that our stylesheet copes with a number of different options, bearing in mind this is just for what displays through a Web browser.
c) Identifiers for the Archive at Lower Levels
On the Hub, lower levels are assigned persistent identifiers in a similar way to collections. A component’s identifier is that of its parent record (i.e. the content of the eadid tag), followed by a hyphen, then the unitid of the component.
There is no lower-level eadid – the only identifier at lower-levels is the unitid. So, we use the collection-level identifier for the description along with the lower-level identifier for the archive as the unique identifier for the lower level description (eadid + lower-level unitid):
<eadid countrycode=”GB” mainagencycode=”1234″ identifier=”JaB/A”>GB 1234 JaB/A</eadid>
<unitid> GB 1234 JaB/A/3/1</eadid>
Would have the URI of:
Maintaining the Distinction Between Identifiers
The Hub has, in a sense, tended to conflate the identifier for the archive with the identifier for the finding aid that describes the archive. Not in the sense of what their function is within the Hub, but more in the way that we took the decision to recommend that these two identifiers should be the same, which makes good sense most of the time. This means that it can be harder to convey the different purposes of the two identifiers. For most of the time this is not a problem, but I think many archivists do think that they are creating one identifier, and don’t think about the distinction between identifying an archive and the description of an archive.
Creating Linked Data Identifiers
Within our Linked Data the identifiers for the descriptions are the Archives Hub URI’s, so that we link back into the Hub from the Linked Data. The challenges are around the URIs for the archive collection or sub-collection.
Initially we used the eadid content in the URI for the archive collection identifier, so for example:
eadid reference = GB 1234 JaB
URI = http://data.archiveshub.ac.uk/id/archivalresource/gb1234jab.
The ‘JAB’ may also be the identifier for the archive, but in this case it is being used as a unique identifier for the description.
The reason that we used the eadid was because it is definitely unique. The Hub requires all eadid’s to be unique. However, there are two main issues with this:
1) The eadid is the identifier for the description, not the archive collection.
2) Sometimes the agency that maintains the description is not the same as the agency that holds the archive. This is reflected in a different code used in the eadid (to reflect the agency that created the description) and the unitid (to show the repository that holds the archive).
eadid = GB 133 PPL – an archive maintained by ’133′, which is the John Rylands Library
unitid= GB 135 PPL – an archive stored at ’135′, The Methodist Archive (part of the Library, but a separate entity)
In this case, we have to maintain the difference between the eadid and unitid because they are telling us different things.
It was for these reason that we felt we should create the URI for the archive from the identifier for the archive, which is the content of the ‘unitid’ tag.
URIs for the Archive at Collection Level
At the top level, things can seem relatively straightforward. Examples of unitid:
GB 1086 Skinner
GB 0532 cwlmga
These are neat examples, and we can translate these into nice Linked Data URIs for the archival resource:
URIs for the Archive at Lower Levels
For lower-levels the unitid entries can be quite complicated, although they should work fine if the country code and repository code are included in some way:
GB 2008 TAS1/1/1
But on the Hub, as I have said, we combine the eadid for the top level with this lower level unitid, for reasons to do with trying to ensure the reference is unique and ensuring that the country code and repository code are incorporated, so
eadid: GB 2008 TaS
unitid: GB 2008 TAS1/1/1
eadid: GB 2008 TaS
Problems with Using the Unitid for the Identifier
1) Attributes for the Country and Repository
The fact that the unitid doesn’t always contain the country code and repository code, or they may be present as attributes or as content, or both, is problematic for Linked Data, where the identifier for the archive needs a unique URI and the attributes help to provide this.
2) Maintaining the Distinction between Identifiers
As set out above, the Hub has recommended that contributors use the same identifier for the finding aid as the identifier that describes the archive (unless the agency is different). The different function of these two identifiers is still preserved within the Hub, but using the same data for them works perfectly well.
However, we should not actually assume that the eadid’s mainagency code and the unitid’s repository code are the same. This is because the code for the eadid agency refers to the agency responsible for the description, and the code for the unitid refers to the repository responsible for the archive. They are usually the same, but they are not necessarily the same. If we want to make statements about our content, such as ‘this archive is held at this repository’ and ‘this person was responsible for creating this description’ then the distinction becomes important as they may be different agencies.
The agencies can also change over time. So, if you take the example of the Papers of Dr Thomas Coke. These papers could be held at the Methodist Archive (code 135) and described by a minimal EAD doc maintained by the Methodist Archive so eadid/mainagencycode = unitid/repositorycode = 135
But then at some point maybe John Rylands University Archive revises the description and extends it (maybe the first one was just collection-level and now it is multi-level). So eadid/mainagencycode = 133 and unitid/repositorycode = 135
The archival stuff hasn’t changed, and it is still held/curated by Methodist Archive, so the URIs of the archival stuff shouldn’t change, even where the description is attributed to a different repository. This means that we shouldn’t rely on eadid content/attributes in constructing URIs for the holding repository.
This whole situation can be complicated by the fact that sometimes the unitid does not contain the repository code at all, so the only code we have is from the eadid, and we have to assume they are the same agency.
3) Inconsistencies in the Data
As stated above, the unitid does not always contain attributes, and so entries vary quite widely. This is mainly as a result of data coming into the Hub from many different sources over a period of time. Many descriptions were created in other systems, and it is always a challenge to move data between systems and end up with something consistent and fit for purpose. Many descriptions were created in something like Word originally, and so issues such as unique identifiers for URIs were not in the game plan at the time. In general, the eadid entries are more consistent and easier to work with than the unitid entries for multi-level descriptions.
4) Unitid may not be Unique
We have hit problems with the untid not being unique throughout the Hub, mainly for lower-level descriptions, and this is the most significant problem. For the Hub, the only identifier that has to be unique is the identifier for the description, the eadid, because this is what the Hub works with – the Hub essentially works with the description.
Process to Identify Duplicate Unitid’s
Pete created an analysis of unitid content and attributes using XSLT, a nifty piece of work that allowed me to see exactly where the duplicate identifiers are. We found that in general duplicates apply to both the raw EAD and the identifiers created by the Archives Hub. But the transformation process, whereby the Hub converts to lower case and uses ‘-’ instead of ‘/’, can create a duplicate where it did not exist in the raw EAD, such as for StoneR and Stoner (both become ‘stoner’), or for MS/1 and MS-1 (both become ms-1).
With this spreadsheet I was able to order the entries by ‘clashes’ within the identifiers and then decide how to tackle them.
The ‘Stoner/StoneR’ problem is tricky. One way to get round it would be to go back to URIs that are case sensitive, but this does seem like a retrograde step. Another option would be to work with the contributors to avoid using case as a means to distinguish references, and I think this is what we will do. I think they will be happy to work with us on this, so that we can stick to lower-case URIs but avoid duplication.
Distinguishing Repository Code and Reference
There is an issue with identifiers where the local reference starts with a number, e.g. http://archiveshub.ac.uk/data/gb1067esw. It looks like the repository code is ’1067′ but actually it is ’106′ and the reference is ’7ESW’. This could potentially be a problem if the repository codes of ’106′ and ’1067′ were both used on the Hub.
We are starting to think about converting to URIs with hyphens to show the three parts of the reference, e.g. http://archiveshub.ac.uk/data/gb-106-7esw.
We had to deal with filtering out ‘former reference’ which many contributors include as a second unitid entry, particularly where the descriptions come from other systems. This is usually a relatively straightforward process, although slightly complicated by the fact that in a few instances the attribute value ‘former reference’ isn’t used. E.g:
<unitid label=”former reference”>http://archiveshub.ac.uk/data/gb123abc</unitid><unitid>Former reference: http://archiveshub.ac.uk/data/gb123abc</unitid>
The latter example can create problems for us. However, as with so many things, there is one further issue here: some contributors actually want the former reference (what they might call ‘alternative reference’) to be the current reference – in this case we are stumped! The only option would be to edit the Hub descriptions prior to creating the Linked Data.
Changes to References
In the end, even once we face all of the other issues with identifiers, we also have the risk that an archive repository will choose to change its reference. This is not common, but it does happen. We could probably find a way to ‘archive’ the old reference and have some kind of link between the two. The horror scenario would be if the repository then used the old reference, which to them may now be redundant, for a new accession.
Reasons for Duplication
The unitid is usually duplicated due to human error, rather than as part of the system of cataloguing. It isn’t surprising when you consider the level of detail and length of some descriptions, and that they are created with various software applications, some of which don’t really help with creating good identifiers. You find entries with references such as: JAB/1/2(i)/2/2, JAB/1/2(i)/2/3, JAB/1/2(i)/3, JAB/1/2(i)/5. Once you identify where the mistake is, you can simply correct it.
In one interesting example for a large collection, the duplicate identifier was purposely used, because the higher level entry described a person and the entry below that described the ‘stuff’ about them. So, something like ABC/JB for an entry about John Bunn in the ABC collection, and ABC/JB for a further entry for a file of stuff about John Bunn. You can see how in the display this works OK, although it means each entry has the same URI, but in terms of Linked Data it is a problem.
Our plan at present is, as far as possible, to correct the data, rather than try to work around it. This means some ‘persistent’ URIs will change, but it seems worth the end result of ensuring unique identifiers. We will have to run the analysis on all of our data in order to pick up any issues that need addressing. This has the advantage that we can also pick up other problems with the identifiers, such as the odd incorrect repository code, or attributes absent from eadid entries.
It has been crucial for Pete, who has done the processing, to really understand how URIs are constructed, and in doing this it has become apparent how our system has been skilfully developed by John Harrison, our developer, to cope with many variations, as you have to with a large aggregator. It would have been extremely difficult to ensure a level of rigour in the data initially, when we were taking in such a diversity of content and focussing on building up the service. In addition, we could not plan for ‘cool URIs’ or, of course, Linked Data.
For Linked Data, rigour is important because you are drawing out so many different entities within the world you are describing, and they all need unique and dereferenceable http URIs. The question is, do you put the effort into introducing more rigour into your source data – is it worth the investment? It is certainly often time-consuming and can be quite a difficult process with so many variables to think about. I think that it generally is worth doing, because more consistent data with properly constructed identifiers must be a good thing, not just for Linked Data but for the whole open and interoperable agenda.