With the various Linked Data projects that are happening around Linked Data in archives, I thought the time seemed right to get together and share what we have done, what we think and what we see as the challenges around this work. The Locah project team were particularly keen to find out what the AIM25 team had been up to with their Open Metadata Pathway and Step Change projects, but it seemed well worth broadening the invite out to other projects, so I put out an invitation to other projects and through the archives email list. In the end we had several projects represented:
Locah / Linking Lives: a project to create Linked Data for EAD archive descriptions on the Archives Hub. We have provided a stylesheet that converts EAD to XML RDF as well as a Sparql endpoint and Linked Data views of our data. We did a significant amount of work around data modelling, the use of RDF vocabularies and creating external links, and we blogged in detail about the processes and issues involved in creating Linked Data from complex hierarchical archive descriptions. We are now working on an interface to show how the Linked Data can be used to bring different resources together.
Open Metadata Pathway / Step Change : Work around the use of OpenCalais and the UKAT thesarus (subjects and names) to extract entities from data and enable URIs to be created. A tool is being developed to allow archivists to do this at the time of cataloguing and the project is working with Axiell CALM to embed this into the CALM software and display via CALMView.
SALDA: A project to output the Mass Observation archive as Linked Data and enhance the data. This built on the Locah work.
Bricolage: will publish catalogue metadata as Linked Open Data for two of its most significant collections: the Penguin Archive, a comprehensive collection of the publisher’s papers and books; and the Geology Museum. It intends to build on previous work done by Locah and SALDA.
Trenches to Triples: will provide Linked Data markup to both collection level descriptions and item level catalogue entries relating to the First World War from the Liddell Hart Centre for Military Archives. It will also provide a demonstrator for using Linked Data to make appropriate connections between image databases, Serving Soldier, and detailed catalogues.
The majority of those attending represented these projects. There was also representation from the DEEP project to digitise English place names and make them available as structured data. Other attendees represented the museum sector and The National Archives. In addition, we brought developers together with archivists and managers, and I think we managed to strike a good balance in our discussions so that they were of benefit to everyone.
In the morning we shared information about our projects. This gave us a chance to ask questions and get a clearer understanding of what people have been doing in this space. A number of issues were presented and discussed.
Extracting concepts from data
We were given a demonstration of the prototype cataloguing interface developed by the OMP project and now being developed under Step Change. It uses OpenCalais to extract concepts from archive descriptions, which tend to be quite document-centric, and contain large chunks of text, particularly in the biographical history and the scope and content sections. The idea is to provide URIs for these concepts, so, for example, OpenCalais highlights a name within a paragraph of text, such as ‘Architectural Association’, and you can then confirm that this is an organisation and it is relevant as an index term so that it is marked up appropriately. The tool is being tested with archivists because ease of use is going to be key to its success. We discussed the limitations of the OpenCalais vocabulary – it does really clever data analysis, but it isn’t geared up for historical data sources. UKAT is much broader and more suitable for archive descriptions – it would be good to integrate this vocabulary into OpenCalais.
Solving the challenges of archive descriptions
We discussed some of the challenges that Locah has faced with processing multi-level archive descriptions, challenges such as: duplicate identifiers for different resources (especially within the same description – more to come on this issue on the Linking Lives blog); creating URIs for data such as extent of archive (where you may have ’10 boxes’, but you may also have ’2 boxes, 10 large photographs and a reel of film’); inheritance of data from the top level down through the levels of a description (which is problematic within Linked Data) and matching names on the Archives Hub to names on VIAF (which we’ve had reasonable success in doing, though in archives names can be quite problematic, such as ‘John Davis, fl. 1880-1890′).
Working with one dataset versus working with a large aggregation
We thought about the comparison between creating a Linked Data output for the Archives Hub, which aggregates data from hundreds of archives, and creating it for just the Mass Observation Archive. Whilst the scale you are working with is appealing with a large aggregator (potential to create Linked Data for all these repositories), working with a discreet collection gives you more control to be able to interpret things within the data (for example, the date may always appear in a certain place, so you can confidently mark up the date as a date and give it a URI).
This led us into some discussion around the way that creating Linked Data can really highlight problems within the data source, and it may provide impetus to address these problems, thus improving the source data by making it more consistent or clarifying the meaning of elements within the data.
Integrating Linked Data into the Workflow
The Step Change project is particuarly focussed on the challenge of making this kind of semantic markup easy to achieve. It has to be well-received by cataloguers. Work is being undertaken at Cumbria Record Office to test the tool out and provide feedback. We discussed the importance of major players such as Axiell CALM embracing this kind of approach, enabling Linked Data to be created from CALM descriptions. This is not yet happening, but the Step Change project is working with CALM and so it is a good starting point. We also discussed the need for the CALM user group to think about whether they want this kind of Linked Data output from their software provider (it needs to be demand-led).
The ‘Same As’ Issue
We touched on the issues around trusted data a number of times. The SALDA project found that creating ‘same-as’ links was probably the most challenging part of the project. We agreed that we must be aware of the importance of archive descriptions being trusted sources and there has been a tendency for some data providers to use ‘same-as’ links too promiscuously. In a Linked Data context this is problematic, as you are then asserting that all of the statements made by the data source you are linking to through this relationship are true. It raises the issue of manual matching as a means to be sure your links are semantically correct, but doing this is time-consuming, so it can only be carried out in a minority of cases.
* * *
In the afternoon we had two sessions, (i) techniques and tools, and (ii) opportunities and barriers. A brief summary of some of the points that were made during the discussions:
Benefits of Linked Data
- The principle of generic APIs – a standard way to provide access to data – it could replace the myriad of bespoke APIs now available.
- Dataset integration – bringing data sources together.
- The precision of information retrieval and giving researchers more potential to ask specific questions of data sources. For example, a researcher may be able to retrieve information around a specific event.
- Embedding the expertise of practitioners was seen as something that should be a benefit, i.e. we should ensure that this happens.
- It encourages cross-domain working.
- It enables people to create their own interfaces and tools to utilise data sources.
- It encourages the creation of narratives through diverse data sources.
- It is very much an ‘anti-silo’ approach to data.
- Expertise required.
- Need to clearly show the extra value it brings (e.g. above what Google offers).
- Need clearer understanding of end-user benefits
- Sustainability and persistence (we talked about the idea of a ‘URI checker’ and using caching to help with this).
- Possible overload caused by large-scale searches or ‘bad’ Sparql queries.
- Licensing, including restrictions on external data that you might want to link to within your data.
- Choice of so many vocabularies.
- Likelihood of not following the same kinds of practice, thus impeding the linking possibilities between datasets.
The group felt that a recommended approach for archive descriptions would be really useful, to facilitate others outputting Linked Data and ensure we do get the benefits that Linked Data potentially offers.
We talked about a generic stylesheet – the community has already benefitted from the data model and stylesheet developed by the Locah project, with AIM25′s OMP project and SALDA both using it for their projects, and Bricolage looking at it for their Linked Data work, but there are still issues around the diverse nature of the data, so a stylesheet to transform EAD descriptions to RDF XML may be a great start for many projects, but modifications are almost inevitable, and expertise would be required for this.
We did decide that a possible way forwards would be what one attendee called a ‘lego approach’, where we think in terms of building blocks towards Linked Data. The idea would be to work on discreet parts of a data model and to recommend best practice. For example, one area would be the relationship between the resource and the index terms or access points. Another would be the relationship between the resource and the holding institution.
This approach should be cross-domain, bringing archives together with museums and libraries. We could look at parts of the model in turn and decide what the relationships are, whether they are consistent across the domains and which vocabularies would be appropriate. The idea would be to end up with a number of ‘RDF fragments’ that people could use, but with the flexibility to extend the model to meet their requirements.
We are hoping to discuss this proposal more and think about what would be required in order to achieve this kind of co-ordinated approach. Obviously we would immediately need to get buy-in across the three domains. Our meeting was representing archives, but this approach would require a very collaborative effort. However, the advantages are very clear, and it does seem to achieve a balance between the challenges of a completely interoperable solution versus the disadvantages of each domain working out different models and using different vocabularies.