Linking Lives Evaluation Report

This blog is based upon a report written by colleagues at Mimas* presenting the results of the evaluation of our innovative Linked Data interface, Linking Lives. The evaluation consisted of a survey and a focus group, with 10 participants including PhD students and MA students studying history, politics and social sciences.

This blog post concentrates on their responses relating to Linking Lives. We also asked about their use of the Archives Hub,  methods of searching and interpretation of results. You can read more about their responses to this on our Archives Hub blog.

mock up of linking lives interface

Mock up of the Linking Lives interface: this shows the interface as it was presented to the participants in the focus group with the results for a search for ‘Beatrice Webb’.

Evaluating the response to the Linking Lives concept is essential in order to be able to answer the crucial questions around the value of a Linked Data approach for researchers. Obviously Linking Lives is just one interface, but it is based upon the principle that lies at the heart of Linked Data: bringing together diverse data sources in order to allow researchers to make new connections. We did not have a live demo to provide to the participants in the focus group, so we used a number of mock-ups to show what was intended. (The site is currently still in development).

Provenance and Quality

Participants in the focus group were clear that provenance was vital to them. They wanted to know where the data had come from. There were concerns about including data from Wikipedia:

“Wikipedia is not considered a good source, so it needs to be clear where the information is coming from.” (Survey respondent)

There was a feeling that Wikipedia is not credible as an academic source. In the Linking Lives interface we have only included the image and the place of birth and death from Wikipedia, rather than any descriptive information. It could be argued that even this is not properly verified, but on the other hand, it could equally be argued that verification through hundreds of people, effectively providing a comprehensive data checking service, could be more accurate than one cataloguer creating a description. A lone cataloguer might be more likely to make a mistake. In addition, a page about a person on Wikipedia may benefit from the expertise of the crowd whereas a cataloguer is not likely to be expert on all the archives they catalogue. This is a very fundamental and broad issue around the integrity, accuracy and trustworthiness of resources, and the Linked Data approach does require us to think more carefully about the issues here, because of the intention to bring sources together. We know that a very large number of linked data sources are using Wikipedia as a hub to link to because of its profile and the links out that it provides. The BBC are including descriptive data from Wikipedia on their pages – but maybe there is a case for using it on a more populist resource rather than a resource that is aimed at academic research? In addition, the BBC invite people to participate by updating the Wikipedia article, and this kind of participatory approach may work better in some situations than in others.

Selection of Data

Participants wanted to know about the choices underlying Linking Lives: why is the data chosen? what gets left out? It is interesting how researchers respond when you explicitly show them a resource that brings sources together and ask them to think about the pros and cons. I wonder how often they think about ‘the data that gets left out’ when using other resources where they are not explicitly thinking about this as an issue.

Maybe it could be argued that keeping sources separate has advantages in the sense that the researcher then uses one source at a time and they are more likely to know what that is, who created it and what it covers. We have found with the Archives Hub (which just searches across archives) that researchers want a clearer idea of what is covered and they don’t always understand the results they see and why they get certain results in response to their searches. I can’t help thinking that, bearing this in mind, bringing diverse sources together may make it more difficult for users to understand and interpret results.

A Biographical Interface

Overall, participants liked the concept of basing Linking Lives around people as a way of “getting a good overview”, and preferred this to an interface based around concepts:

“I’d be a little bit sceptical if you were to extend it to concepts…it might tend to homogenise and evacuate some of the complexities and subtle nuances of particular theories.” (Focus group participant)

But they remained cautious about the the principle of bringing sources together, and there was a feeling that portals like this don’t always do the job very effectively. When asked about the benefits of serendipitous searching there was a feeling that it could potentially be useful but also that it could actually distract the researcher from what is relevant.

Participants were in no doubt that the breadth and completeness of the service would be key to its value. So, for example, if Linking Lives includes a list of works by a person, can the researcher trust that the list is complete? If not then its utility is significantly diminished. Maybe there is an issue here for a Linked Data approach; if you are drawing in data from other sources, you might select those sources on the basis of quality, but you would not be responsible for what they provide. So, for Linking Lives, we would not be able to guarantee that a list of works is comprehensive, although we might choose to take the list from a trusted source such as the British Library.

The benefit of Linked Data is that you can draw in a diverse set of sources, and the aim is to provide a well-rounded view, but the more sources you pull into a single interface, the more you have to consider how to present them clearly; to show that they are distinct sources and to convey to the researcher that they are not under your control. There are certainly issues around expectations and understanding here that need further exploration.

From reading the Evaluation report and thinking a little more about the issues, I wonder whether a front-end designed to enable researchers to utilise sources brought together through a Linked Data approach should focus more on building an appropriate search mechanism. One option would be for researchers to select sources to search from a list so that they are in control of the sources they search. The nature of each source could be explained at this point. For example, a researcher might choose to bring together The Archives Hub, the British National Biography, the BBC and the British Museum (each choice of data set would affect other choices they could make based upon which sources are linked together). When they click to select each of these sources a short summary tells them what each resource offers. The researcher could then go on to search within these sources, and when the results are presented, they already have a reasonable sense of what the data represents.

Linking Lives Audience

The participants generally felt that something like Linking Lives would be more appropriate for undergraduates, or useful for teaching, but it would not enable the more sophisticated searching that PhD students might want to carry out, maybe based on a more contextual approach:

“I think at PhD level there’s a kind of artistry to how you make your way through…I’ve certainly never come across a search engine that can do the same or be as complex as your own thinning patterns.” (Focus group participant)

There was a feeling that having a group of separate archives brought together that relate to one person would be useful for teaching and helping undergraduates to understand more about how an archive works. However, expectations would need to be managed because it might encourage students to think that the archives are more readily accessible than they often are.

In Conclusion

The power of Linked Data to connect diverse sources also seems to raise one of the main challenges. If you want to provide a user-friendly interface, and enable researcher to search across particularly diverse sources (e.g. archival data, census data, climatic data) to make unlikely connections then there is a challenge around how the data is presented so that a researcher can interpret it. Maybe this will inevitably involve the researcher in more complex searches and interpretation of results, but the reward could potentially be high. For example, a researcher might be able to discover correlations between weather patterns and social behaviour over time because the required datasets have been linked together. If Linked Data enables researchers to (reasonably) easily draw quite different sources together then it would offer something potentially very valuable.

In the end, there is still a challenge for Linked Data to make a sound business case and really showcase the end-user benefits. Certainly there is a strong case in favour of the idea of a Web of Data; a Web that is about data and not about documents; something that enables researchers to navigate across data sources rather than jumping from one silo of data to another, but maybe there is too little focus on how researchers will actually achieve this in a way that works for them – how we can present Linked Data to them in a way that really answers their research questions. One of the respondents in the survey said that it sounded confusing when the principles of Linked Data were explained and this may present its own challenge – explaining what Linked Data actually is and how it works. As Joy Palmer states in her commentary on the Evaluation Report:

“Whilst it could be said that it is not important for users to understand how data is pulled together under the hood, our research suggested that potential users, particularly advanced researchers, do indeed have an interest in how and why this information has been gathered together in a particular way. [To] what extent is it possible, or even desirable, to explain the mechanics of Linked Data, and does an understanding of how Linked Data works represent an advanced aspect of information or digital literacy?”

Maybe we’ve reached the point in the Linked Data story where we need to focus more strongly on how it will answer the requirements of researchers. Maybe we need to find better ways to explain Linked Data to them and the vision that goes along with it. Surely we need a more collaborative approach that draws in the technical people, the information professionals and the researchers.

 

* Evaluation Report by Lisa Charnock, Frank Manista, Janine Rigby, Joy Palmer

Posted in barriers, benefits, evaluation, researchers | 1 Comment

Data Publication and Linked Data in the Humanities

National Library Wales LogoLinking Lives in the guise of myself and Jane have been invited to speak at the ‘Data Publication and Linked Data in the Humanities‘ event coming up on November 12th at the National Library of Wales. There’s no charge to attend so why not come along if you can make it?

We’ll say where we’ve got to with the project and we’ll be talking about how we’ve been addressing our “major challenges”, essentially what the day is about. The workshop is co-organised by the National Library of Wales and King’s College, London, and funded by JISC. It will “investigate how linked data could serve the digital arts and humanities by bringing together international experts in the semantic web to discuss existing approaches in the digital arts and humanities“.

More information and sign-up details on the event web page.

Posted in events | Tagged , , | Comments Off

NISO Information Standard Quarterley on Linked Data

NISO ISQ: http://www.niso.org/publications/isq/

The NISO publication, Information Standards Quarterly, Spring/Summer 2012, is dedicated to Linked Data in Libraries, Archives and Museums. It includes an article on the Linking Lives project. The article can be found at http://www.niso.org/publications/isq/2012/v24no2-3/stevenson/

It includes sections on: the background to the project; interface design; the challenges of the source data; working with external datasets; a technical perspective on data collection; problems of identity.

The issue also includes articles on schema.org, Europeana and LODLAM.

 

 

Posted in archival context, archival description, barriers, identifiers, interface, linked data | Comments Off

The Winner Takes it All? -APIs and Linked Data Battle It Out

Myself and Jane spoke about the Linking Lives project at the EMTACL ‘Emerging Technologies in Academic Libraries’ conference in Trondheim, Norway a few days ago. We were co-presenting ‘The Winner Takes it All? – APIs and Linked Data Battle It Out’, contrasting the Linking Lives Linked Data approach with the lightweight APIs approach of another project I’m working on, the WW1 Discover project.

I’ve blogged on the WW1 project blog about how we got on.

Posted in Uncategorized | Comments Off

URIs, identity, aliases & “consolidation”

Jane has written a few posts recently on our efforts to improve the stability of URIs used for pages about archival resources on the “live” Archives Hub service, and as far as possible we’ll be trying to reflect the changes made there in the URIs we use in the Linked Archives Hub RDF data. Much of that work has led to a review of the conventions used in the source EAD XML data and a concerted effort to “cleanse” or enhance that data to improve its coherence and consistency.

In this post, I’ll focus on some issues around the URIs used to identify Persons in the Linked Archives Hub data. It’s something I’ve been trying to write on and off over a period of several weeks, and a combination of some work I’ve been doing for the Bricolage project, and some subsequent conversations, have prompted me to try to knock my rather rambling drafts into shape.

The Data Transformation Process

It’s probably worth taking a step back and emphasising that the process by which the Linked Archives Hub RDF data is generated is currently a relatively simple one:

  • EAD XML documents are transformed into RDF/XML using an XSLT transform. This process is performed on a “document-by-document” basis, i.e. it has as input a single EAD document and an XSLT stylesheet and outputs a single RDF/XML document; the process does not have any “knowledge” of the other EAD documents within the dataset to be transformed.
  • The output from the transform is uploaded to a triple store.
  • Some supplementary data is uploaded alongside the data derived from the EAD documents. This data is the product of various processes: some is “hand-crafted”; some is imported from external sources; some is the result of processes run over the EAD-derived data; some is the result of “lookups” against external datasets – but for the purposes of this discussion, the key point to note is that it is “added to” the EAD-derived data, and that EAD-derived data itself is not changed.
  • That data is served as “Linked Data” “bounded descriptions”.

The URIs used to identify persons in the Linked Archives Hub dataset have their origins in the names of persons occurring in the Archives Hub EAD XML documents. Within those documents, person names occur in two contexts (or at least the EAD-to-RDF transformation process currently takes into account occurrences of names in two contexts). I’ll describe here how the conversion process handles this data, what RDF data is generated and then look at some of the issues this raises.

The examples I’ll use are all from the small subset of EAD documents included in the current Linked Archives Hub data. I’ve picked the case of Beatrice Webb, which illustrates several of the variations which can occur and the issues which arise.

Personal names as index terms

The first context is that of personal names added to the description by the cataloguer as “index terms” on the basis that they may be useful for the purposes of retrieval/search/browse. In the Hub EAD documents, they occur in XML structures like the following, using the EAD controlaccess element. In its simplest form, this looks like:

<ead>
  <eadheader>
    <eadid></eadid>
    ...
  </eadheader>
  <archdesc level="fonds">
    <did>
      <unitid></unitid>
    </did>
    ...
    <controlaccess>
      <persname source="nra">(name)</persname>
    </controlaccess>
  </archdesc>
</ead>

In some (but not all) of the Hub EAD documents, a convention employing the emph element and emph/@altrender attribute is used to capture the distinction between the component parts of a name constructed according to a name rules system – this is something local/”proprietary” to the Hub application (and really a “redefinition” of the EAD tag semantics): a “standard EAD” application would not interpret the markup in this way.

<ead>
  <eadheader>
    <eadid></eadid>
    ...
  </eadheader>
  <archdesc level="fonds">
    <did>
      <unitid></unitid>
    </did>
    ...
    <controlaccess>
      <persname source="nra">
        <emph altrender="surname">Webb</emph>
        <emph altrender="forename">Martha Beatrice</emph>
        <emph altrender="dates">1858-1943</emph>
        <emph altrender="epithet">Social Reformer</emph>
      </persname>
    </controlaccess>
  </archdesc>
</ead>

Within the subset of the Hub EAD data currently transformed into RDF, this same “index term” – same XML fragment – is used in three different EAD XML documents :

In this example, the persname/@source attribute is used to capture the name of a “name authority file” from which the name is drawn, the “nra” value here indicating the use of the National Register of Archives (NRA). The NRA itself is not currently available as Linked Data, so does not provide URIs for the entities described. The NRA record for Beatrice Webb is http://www.nationalarchives.gov.uk/nra/searches/subjectView.asp?ID=P29999. In fact, the actual form of the name used in the authority record (“Webb, Martha Beatrice (1858-1943) nee Potter, Social Reformer”) does appear to differ slightly from that used in these three EAD documents (i.e. it includes “nee Potter”).

As I discussed on the LOCAH project blog, in our mapping of the EAD data into an RDF representation, from this XML structure we generate two resources to try to capture the distinction between the person and the “conceptualisation” of that person reflected in the authority file entry or the use of the name rules. The two resources have distinct URIs and are linked using the foaf:focus property.

The patterns for the URIs for both the concept and the person are similar, and based on a combination of:

  • the name of the authority file or (see below) of the name rules
  • a “slug” derived from the the name itself (including life dates, titles, epithets etc)

So for the cases above the Person URI generated is:

  • http://data.archiveshub.ac.uk/id/person/nra/webbmarthabeatrice1858-1943socialreformer
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

<http://data.archiveshub.ac.uk/id/concept/person/nra/webbmarthabeatrice1858-1943socialreformer>
  a skos:Concept ;
  rdfs:label "Webb, Martha Beatrice, 1858-1943, social reformer" ;
  foaf:focus 
    <http://data.archiveshub.ac.uk/id/person/nra/webbmarthabeatrice1858-1943socialreformer> .

<http://data.archiveshub.ac.uk/id/person/nra/webbmarthabeatrice1858-1943socialreformer>
  a foaf:Person ;
  rdfs:label "Webb, Martha Beatrice, 1858-1943, social reformer" ;
  foaf:name "Martha Beatrice Webb" ;
  foaf:familyName "Webb" ;
  foaf:givenName "Martha Beatrice" .

In other cases, the persname/@source attribute is not present, but instead the persname/@rules attribute is used to provide the name of a set of “name rules” under which the name is constructed. The example below refers to the use of “ncarules”, i.e. the National Council of Archives’ Rules for the Construction of Personal, Place and Corporate Names.

<ead>
  <eadheader>
    <eadid></eadid>
    ...
  </eadheader>
  <archdesc level="fonds">
    <did>
      <unitid></unitid>
    </did>
    ...
    <controlaccess>
      <persname rules="ncarules">
        <emph altrender="a">Webb</emph>,
        <emph altrender="forename">Martha Beatrice</emph>
        <emph altrender="dates">1858-1943</emph>
        <emph altrender="other">nee Potter</emph>
        <emph altrender="epithet">social reformer and historian</emph>
      </persname>
    </controlaccess>
  </archdesc>
</ead>

This form is present in seven EAD documents:

and is mapped by the transform to the URI

  • http://data.archiveshub.ac.uk/id/person/ncarules/webbmarthabeatrice1858-1943neepottersocialreformerandhistorian

A second form of the name, also constructed using NCA Rules, but with a variation in the epithet and the “nee Potter” omitted, is also used:

<ead>
  <eadheader>
    <eadid></eadid>
    ...
  </eadheader>
  <archdesc level="fonds">
    <did>
      <unitid></unitid>
    </did>
    ...
    <controlaccess>
      <persname rules="ncarules">
        <emph altrender="a">Webb</emph>,
        <emph altrender="forename">Martha Beatrice</emph>. (
        <emph altrender="y">1858-1943</emph>)
        <emph altrender="epithet">social reformer</emph>
      </persname>
    </controlaccess>
  </archdesc>
</ead>

This appears in one EAD document:

and is mapped to the URI:

  • http://data.archiveshub.ac.uk/id/person/ncarules/webbmarthabeatrice1858-1943socialreformer

There are a few points worth noting here:

First, and most obviously (and this was the point that initially prompted me to start writing this post), the fact that different forms of name can – quite legitimately, within the constraints of the EAD format and the Hub data entry guidelines – be used as index terms to refer to the same person across the dataset means that we end up generating through our transform process – and publishing/exposing to the Web in our data – multiple URIs for the same person. From the cases above, we have three distinct “URI aliases” for Beatrice Webb:

  • http://data.archiveshub.ac.uk/id/person/nra/webbmarthabeatrice1858-1943socialreformer
  • http://data.archiveshub.ac.uk/id/person/ncarules/webbmarthabeatrice1858-1943socialreformer
  • http://data.archiveshub.ac.uk/id/person/ncarules/webbmarthabeatrice1858-1943neepottersocialreformerandhistorian

Second, the use of the name to construct the URI is not a guarantee of avoiding URI ambiguity (i.e. of having a single URI used to refer to what are in fact two different things). In archival description data it is quite common to encounter names without complete life dates or epithets, and in a dataset the size of the Hub, it is quite possible that there are two occurrences of an index term like “Smith, John, 1945-, engineer”, both constructed using the same “name rules”, which are intended as references to two distinct individuals but would be mapped to the same URI.

Third, the “repeatability” of the transformation process over time is not guaranteed. If any of the name components changes in the EAD document (e.g. a previously unknown date of death is added, or an “epithet” is added or removed), then the subsequent re-transformation of the data will generate a different URI from that generated from the previous process using the initial form of the name. (Is “Scott, James, 1950-2012, biologist” in this version the same person who was referred to as “Scott, James, 1950-, scientist” in a previous version?)

Fourth, for both URIs, that of the Concept and of the Person, the URI includes the name of the “authority file” or name rule system.

I’m willing to concede that for the Person case this may be “overkill”. I think I chose this because I was wary of conflating what were in reality two different persons based on matches in their names. So, on this basis, it should not be automatically assumed that the same form of name in two different authority files refers to the same person, at least not without some human verification – though having said that, if there is a match on “life dates” and “epithets”, then it seems highly probable that they do.

Similarly with the name rule systems case. The situation here is probably even more complex, as in archival description data it is quite common to encounter names without complete life dates or epithets. I also wondered if it was theoretically possible that under two different name rule systems, different surname/forename ordering rules might result in two quite different names mapping to the same string in the URI. e.g. forename = James and surname = Scott under a surname first rule would result in “scottjames….” and forename = Scott and surname = James under a forename first rule would also result in “scottjames….”.

So, in short, retaining the name of the name rules or the authority file as part of the Person URI was part of an attempt to avoid accidentally conflating what may be two different person, i.e. to reduce instances of the second problem above, though this very tactic potentially contributes to the first one!

Personal names as names of the creators/”originators” of archival resources

The second context in which personal names are found is as the names of agents responsible for the creation or bringing together of the resources described. In the Hub EAD documents, they occur in XML structures like:

<ead>
  <eadheader>
    <eadid></eadid>
    ...
  </eadheader>
  <archdesc level="fonds">
    <did>
      <unitid></unitid>
      <origination>(name)</origination>
    </did>
    ...
  </archdesc>
</ead>

In the Hub EAD data, there is no guarantee that the data indicates whether the name is that of a person or an organisation. Although the EAD schema does support the use of the <persname> and <corpname> within the <origination> element, and indeed it is present in some Hub data, the Hub data entry tool does not provide this distinction.

While cataloguers are encouraged to provide the name of the originator also as an index term, this guideline is not always followed.

Furthermore, the Hub data entry guidelines for this element encourage the use of “the commonly used form of name”, so it may be that the form of name used here is different from that used as an “index term”, which creates potential complexity in trying to “reconcile” the two.

Beatrice Webb appears as the creator/originator of five collections:

using one of the following two XML structures:

<ead>
  <eadheader>
    <eadid></eadid>
    ...
  </eadheader>
  <archdesc level="fonds">
    <did>
      <unitid></unitid>
      <origination encodinganalog="3.2.1">
      Webb, (Martha) Beatrice, 1858-1943, wife of 1st Baron Passfield, social reformer and historian
      </origination>
    </did>
    ...
  </archdesc>
</ead>
<ead>
  <eadheader>
    <eadid></eadid>
    ...
  </eadheader>
  <archdesc level="fonds">
    <did>
      <unitid></unitid>
      <origination encodinganalog="3.2.1">
      Webb, Martha Beatrice, 1858-1943, wife of 1st Baron Passfield, social reformer and historian
      </origination>
    </did>
    ...
  </archdesc>
</ead>

The name-to-URI mapping algorithm discards the parentheses so both cases map to a single URI:

  • http://data.archiveshub.ac.uk/id/agent/gb97/webbmarthabeatrice1858-1943wifeof1stbaronpassfieldsocialreformerandhistorian

Post-transform processing

After this EAD-derived data is uploaded to the triple store, some further processes are applied:

  • a “lookup” process which extracts information about “persons” in the Hub data and searches for candidate matches in the VIAF dataset
  • a process which seeks candidate matches within the Hub dataset between “agents” (generated from the creator/origination context) and “persons” (generated from the index terms context)

The result of this is the addition of a set of triples with owl:sameAs predicates to indicate that the various data.archiveshub.ac.uk URIs (and the VIAF URI) identify the same person.

One of the problems with this approach is that an application consuming the data still has to be prepared to work with these multiple URI aliases, and particularly with SPARQL, this can be quite cumbersome: given URI X denoting a person, to find all the data we hold about the person, an application has to search for patterns involving not just that known URI X but also any URI Y, where URI Y sameAs URI X.

Materialising inferences?

As a possible further measure to mitigate these difficulties, we might perhaps take the approach of further “materialising inferences” based on these owl:sameAs predicates, i.e. explicitly adding to the data the further set of triples which can be inferred from those triples. While this would facilitate querying, it increases the size of the dataset and also (from a “provenance” perspective) adds to the complexity of managing how we distinguish the different sources of data (e.g. which triples had their origin in the transformation of the source EAD documents and which were added by subsequent processes).

Consolidation and “Annotation”

I’m coming to the conclusion that what while our current process is “OK-ish” as a first stab at generating an RDF representation, the “repeatability” issue (change in name resulting in change of URI) is a problem, and these multiple URI aliases in the published data is, while not strictly “wrong”, at best rather “sub-optimal” for consumers of the data.

The “repeatability” problem is the consequence of our basing the “slug” in the Person URI pattern on data attributes that can change over time. At the time the transform is applied, the only data that is available is the name (and the associated attributes), so I’m not sure I have a good answer to this. One approach would be to see the transformation stage as only the first part of a larger process, to keep track of the URIs generated over time, and build in a stage of processing to reconcile the URI generated from “Scott, James, 1950-2012, Sir, biologist” this week from that URI generated from “Scott, James, 1950-, scientist” in the previous version of the document six months ago. This perhaps then becomes simply a special case of the second problem, of dealing with multiple URIs for a single entity.

On the second problem, given the nature of our input data, it may well be a necessary part of the process that the initial transformation stage does result in multiple URIs. But once we’ve applied the post-transform processing to “reconcile” these references, rather than publishing a set of sameAs triples, maybe we should take a step further and consider “consolidating” our data to use a single URI for the person?

So e.g., if our post-transform processing tells us that, as I describe above, we have four distinct data.archiveshub.ac.uk URIs which all refer to the person Beatrice Webb, should we “distill down” to one of those four, and replace the occurrences of the other URIs in the data?

Furthermore, if we know that the content of any name is potentially unstable (i.e. “Scott, James, 1950-, scientist” can be replaced by “Scott, James, 1950-2012, Sir, biologist”), should we be using this as the basis of a URI at all, even in the case where – at this point in time – it is the only name for that person in our dataset? Should we instead manage a mapping to some sort of code and use that to construct a distinct URI again? The challenge is in creating a process/workflow which makes this easy to do, and repeatable if/when data is reprocessed or new data is added.

A further possibility is suggested by a post by Leigh Dodds which I’ve had at the back of my mind for a while, and which he mentions again in a more recent post.

Leigh argues that as Linked Data providers we tend to publish data using our own URIs, then “reconcile” some of those URIs with some existing published URIs for the same entity created by other providers, and add owl:sameAs assertions to indicate that they are co-references – much as I’ve describe here for the Linked Archives Hub case. But an alternative approach in which, instead of publishing our own URIs, we use those existing URIs directly in our own data may well make our data easier to use. Leigh refers to this as “Annotated Data” – in the sense that we are providing new triples using an existing URI. Applying this to our concrete example for Beatrice Webb, if, as I suggest above, it would be a Good Thing to “distill down” our four different URIs for Beatrice to a single URI and substitute that single URI in our data, could we use, say, VIAF’s URI for her for that purpose?

In fact, we already make use of externally-owned URIs directly for the case of languages, where we simply use lexvo.org URIs directly in our data. One motivation for choosing this approach was that it was trivial to construct the lexvo.org URIs in the transform process using the language codes present in the EAD data. Obtaining a VIAF URI for a person, on the other hand, is a rather more complex task involving a search of another dataset and (in some cases, at least), a process of manual verification of candidate matches. But in spite of the difference in the processes of obtaining the URIs, are the two cases so distinct? Particularly if we start to think of our data publication as rather more of a multiple-stage process, I admit I’m less sure than I might have been at one point.

One factor might be our level of confidence in the stability of any external URIs we use. I’m not sure VIAF has published any formal policy regarding its URIs. But on the other hand, part of the problem that we are grappling with is that of maintaining the stability of our own URIs!

Another factor is that the consequence of adopting the “annotation” approach is that when it comes to dereferencing URIs, we would no longer have a data.archiveshub.ac.uk “Person URI” which we can redirect to a document/graph that we serve. Obviously, the VIAF URI for Beatrice Webb redirects to a document served by VIAF – which would not provide the information that, say, she was the creator/originator of the five archives above or the “foaf:focus” of those concepts associated with the eleven other archives. That information is still present in our dataset, and would be available via SPARQL, and as part of other Linked Data documents we do serve (e.g. the bounded description of the archival resource would include a triple indicating its creator/originator). In principle, we could also, as Leigh suggests in the penultimate paragraph of his post, continue to serve a document providing a bounded description, much as we do now, but its subject would be <http://viaf.org/viaf/86607236> rather than a data.archiveshub.ac.uk URI. The challenge then becomes one of how to make that document discoverable (through foaf:isPrimaryTopicOf/wdrs:describedby/rdfs:seeAlso links? through third-party services built on such links?)

I admit I hesitate to advocate taking this plunge at this point. The cases of the Language URIs and the Person URIs do seem to be different – although in ways I’m not sure I can articulate very clearly! Using the lexvo.org Language URIs seems appropriate in part because it doesn’t seem like we have “anything interesting to say” about languages, but the person case feels more “core” to “our data”. Also we will almost certainly always have to handle cases for which VIAF doesn’t provide a URI and we need our own Person URI. On the other hand, if, say, the National Register of Archives “authority file” data had already existed as Linked Data, and provided URIs for persons, would we still coin our own URIs for those cases? Or would we have simply adopted their URIs wherever we could? I’d hope we’d have chosen the latter. Maybe we really do need to become more relaxed about embracing the use of others’ URIs.

So… I think we need to think more about whether to take that step of using external URIs instead of our own, but I do think our URI alias issues in general need some attention, probably involving some sort of an extension to the current process to introduce a “consolidation” step between the transformation stage and the publishing stage so that where we know we are coining multiple URIs for an entity, we publish only one of them.

Posted in archival description, identifiers, linked data | Tagged , , , , , | 2 Comments

GLAM Rocks! – Libraries, Media & The Semantic Web hosted by the BBC

I had the very great pleasure of speaking at the ‘Libraries, Media & The Semantic Web’ event hosted by the BBC Academy last Wednesday, along with folks from the New York Times, the BBC, Google in the guise of Schema.org, Historypin and KONA. The event was organised by the Lotico London Semantic Web Group.

The General Manager for News and Media at the BBC, Phil Fearnley, introduced the event, and immediately caught our attention by informing us that the BBC is continuing to make a substantial commitment to semantic web technologies, having devoted 20% of it’s entire digital budget to activities underpinned by this technology. Nice one Phil.

After a few opening words from Marco Neuman of Lotico, Jon Voss was then up, giving us a briefing on Linked Open Data in Libraries, Archives and Museums (LOD-LAM) efforts around the world, and upcoming plans within the community. He talked about how the first International LODLAM Summit held in San Francisco last year has galvanised the LODLAM community, and helped kick-off a number of activities. Jon was the main convener of the summit, and kindly asked me to be on the organising committee, so, although you could say I’m biased, I can vouch for the fact that it was a great event. He also mentioned how the number of LODLAM events across the world has grown, with meetups in Australia, the UK and a number of places around the USA. Jon also talked about some recent work Historypin are doing to allow users to dig deeper into archival records based on time and place, to enhance the Historypin experience using linked data principles. He wrapped up by emphasising the importance of open licenses, and how open data has to come before linked open data.

I was up next, giving a whistle stop tour of UK LODLAM activities, myself being Adrian Stevenson, Senior Technical Innovations Coordinator at Mimas, University of Manchester. Given that I was in the vicinity of where the classic glam rock bands have played, I couldn’t resist the temptation to use the galleries, libraries, archives and museums ‘GLAM’ acronym for my presentation title, and throw in a glitter platform shoe on the opening slide. I covered the work of the LOCAH and Linking Lives projects, before giving a heads up to a number of the JISC funded Discovery projects doing linked data work, including the Bricolage project in which our own Pete Johnston is involved, and the newish World War One Discovery project I’m working on. I finished up by focussing on particular challenges we’ve met on Locah and Linking Lives, namely the difficulty of creating links based around names, and the general problem of finding data to link to.

We then moved to the media perspective on things, with Evan Sandhaus, lead architect for semantic platforms at the New York Times, giving us the low down on rNews, an embedded data standard for the news industry from the IPTC. Evan explained the ‘silly’ situation we’ve ended up in, where the data content of news articles is kept in structured form behind the scenes in databases, but this structure is lost when the data is presented to the Web in HTML. To address this weakness, the IPTC came up with the rNews data standard, which is defined as “a data model for embedding machine-readable publishing metadata in web documents and a set of suggested implementations”. Currently there are RDFa and HTML5 implementations, with a JSON implementation under consideration.

Addressing the benefits, Evan explained that rNews can provide superior algorithmically generated links, such as those generated by Google Rich Snippets, thereby improving referral traffic. In addition, it can allow for better analytics provided by the better quality data. It was noted, however, that these benefits will depend on the wide adoption of rNews in the community. He then gave a short history of the development of rNews, culminating in the announcement that it has now been adopted by the New York Times, and is used on all news articles published after 29th January 2012. Evan mentioned how the arrival of Schema.org, which essentially does the same thing as rNews, caused something of an “existential crisis”. Fortunately, the organisations have worked together, and schema.org has now been expanded to absorb about 98% of the rNews data model.

Dan Brickley from Google, working on Schema.org, gave a really interesting talk looking back at the history of search and structured data over the past 100 years. He used this as a way to highlight the connections between the GLAM sector, the media, and the problems schema.org is aiming to solve. Dan proposed the notion that somewhere in Belgium, semantic search over structured data went mainstream as long ago as 1912. He backed this up by quoting some search queries logged in the 1912 annual report from the Belgian Institute of Bibliography. Dan went on to talk about Lonclass, a BBC media archives classification system still used today. Dan suggests that Lonclass is based on structured semantic data, having compositional semantics predating computing. Using Lonclass, it’s possible to build sentences from its semantics, e.g. the lonclass code ‘656.881:301.162.721 for “Letters of apology” can be combined with the codes for ‘resignation letters’, ‘Margaret Thatcher’ etc.

Dan described how Schema.org, launched in June 2011, is essentially the result of a loose collaboration of engineering groups from Google, Bing, Yahoo & Yandex. Having been somewhat behind the scenes, they are moving increasingly to a collaboration model in the public space, the vocabulary development now being hosted by the W3C. Google Rich Snippets was cited as the best known way in which this markup is being used, and the business story is that if you use schema.org markup, your page is better described, you get more click-throughs, and people can better understand search result lists. He noted there’s also an advertising aspect, though this is not part of Dan’s work. The overarching aim is to give more accurate search results. Dan reckons schema.org counts as linked data, as the markup that describes someone, say Douglas Adams, points off to another page providing more info about Douglas Adams. Dan rounded off suggesting Schema.org is basically a dictionary of terms drawing on the everyday scenarios of search. It was interesting to note that he thinks the semantic web world is too polite in feeling the need to use other people’s terms. Schema.org is relatively ‘rude’, having about 300 terms, but he believes this makes it easier to deploy.

Silver Oliver from BBC News and Knowledge outlined how they’ve been doing ‘more of the same’, building on the semantic web work used for the World Cup and applying it to the new sports site, and the upcoming 2012 Olympics site. There’ll be representations for every athlete, medal event, venue, and so on. The underlying linked data principles are the same, i.e. tagging with HTTP URIs that are then used as hooks into the web graph. They’ll be using geonames for locations, hooked onto IOC Olympic content, which typically comes in spreadsheet form. They use Google Refine with the DERI RDF plugin to get RDF from spreadsheets, then add in other existing BBC RDF content, stitching these datasets together to create useful graphs. This approach gives the benefit of providing ‘page furniture’, for example, using information on the country Jamaica, and the IOC statistics on Jamaica’s performance in Olympics, to frame and enhance the BBC content on Jamaican athletes.

Silver mentioned that Google is their biggest data consumer, using their microdata and RDFa. He noted that the 2012 Olympics pages will have schema.org data in, and also mentioned work using hRecipe for exposing structured recipe information: these have surfaced really well on Google.

Yves Raimond from BBC R&D then talked about the challenge of surfacing the huge amount of excellent BBC archive content, and the challenge of making it connect with current content. The BBC has a massive archive, but tagging has only been used for a few years, and much of the archive has only very sparse and often incorrect metadata. He described how they’ve been using automated tagging with linked data URIs to make connections to current content to help push the archive to users. They’ve been trialing the approach on the World Service archive, which contains a massive audio database. They’re using a piece of software they’ve developed called ‘KiWi’, built with open source components, and some custom built alogorithms to automatically tag content. CMU Sphinx is used to create ‘very noisy‘ speech to text transcripts. More will be published on how they’re using KiWi in the next few months. Yves then showed us examples of autotagged programme content. As he noted, it appears to do a decent job, but some of the tags are wrong. He mentioned the possibility of using crowdsourced tags to improve the accuracy of the content.

That was basically it for the presentations part of the proceedings. All the speakers then came up for a short Q&A session, mainly focussed on the media side of things, and after this we headed to the nearest bar. All in all it was a great evening, and I felt quite privileged to be part of a panel of such esteemed experts.

I’ve included the speakers slides where I’ve been able to track them down below:

Posted in archival context, linked data, open data | Tagged , , , , | 6 Comments

From EAD to Linked Data: Talk at UCL

Last Friday, at the invitation of Jenny Bunn, I visited UCL to talk to some of her postgrad students on the MA course in Archives and Record Management about Linked Data in general and the experiences of the LOCAH and Linking Lives projects in particular. I don’t think I really covered anything that we haven’t mentioned already here or on the LOCAH blog, but it gave me an opportunity to combine some general “tutorial”-ish background material with a few thoughts on some of those aspects of archival description and EAD that at times make the process of generating RDF “challenging”, and I thought I’d share the slides here (PDF).

Posted in archival description, linked data | Comments Off

Do not underestimate cleaning your data!

In Linked Open Data: The Essentials (Bauer, Kaltenbock) The first steps given for publishing your content as LOD are:

1. Analyse your data

2. Clean your data

3. Model your data

…and it goes on to very helpfully summarise the further steps required. The steps given are typical of the advice often given about how to create Linked Data.

Under ‘Clean your data’ it states:

Data and information that comes from many distributed data sources and in several different formats (e.g. databases, XML, CSV, Geodata, etc.) require additional effort to ensure easy and efficient modelling. This includes ridding your data and information of any additional information that will not be included in your published data sets.

In retrospect, I greatly underestimated this particular step. Format is fine as far as we are concerned, but our data does come from many data sources – from over 200 sources in fact. I’m not sure about ridding the data of additional information, but for us issues around data consistency have created a very significant amount of extra work; work that I did not properly factor into the process.

Before I say any more about this, I want to make one thing clear: in talking about inconsistency and ‘errors’ in the data, I am not wanting to criticise the Archives Hub contributors at all. For a start, much of the data in the Hub was created over many years, and much has been migrated from many different systems. Secondly, we were simply not thinking in a Linked Data way 5 or 10 years ago. We didn’t necessarily prioritise consistency in instances where it now becomes much much more important. We didn’t ask for things that we now ask for, or ensure checks were made for certain data. We had other priorities, and the challenge of just creating an aggregator from scratch was pretty huge.

In Linked Data, you are bringing all (or many) of the entities within the data to the fore. In a way, it’s as if they can’t hide anymore; they can’t just sit within the context of the collection description and display themselves to users through a Web interface. They have to work a bit harder than that because essentially they all become lead players. And it feels to me as if this is what really makes the quality of the data so important.

I have recently blogged about the issue we have had with identifiers. This is probably the biggest issue we have to deal with. But others have come up. For example, some of our descriptions have ‘Reference’, as you would expect, but they also have ‘Former Reference’ (both in the same tag of ‘unitid’). The problem with this is that it is not always encoded consistently, so then it becomes hard to say ‘where X is included do Y’.

Another example is where we have two or more creators for a description. Up until now, we have simply had one field for contributors to add ‘name of creator’ (the EAD ‘origination’ tag) but that means that two or more names simply go into the same field are not made distinct in a way that a machine can process. It’s fine for display. A human knows that Dr James Laidlaw Maxwell, Dr James Preston Maxwell means two people. But it is harder for a machine to distinguish names if there isn’t a consistent separator. In Linked Data terms it may mean that you end up with a person effectively identified as ‘drjameslaidlawmaxwell,drjamesprestonmaxwell’. (The comma may seem like a reasonable separator, but often commas exist within names, as they can be inverted, and other entries don’t use a comma).

During our Linked Data work, what we have done when we find a data issue is to make a decision whether the issue will be dealt with through the conversion process or dealt with at source. In general, I think its worth dealing with issues at source, because it tends to mean the quality and consistency of the data (thinking particularly in terms of markup) is improved.

Furthermore, this emphasis on the data has led us to think quite fundamentally about many aspects of our data structure, the ways that we ask people to create descriptions and how we can improve our ‘EAD Editor’ in order to ensure more consistency – not just from a Linked Data perspective. It has contributed to a decision to make this kind of data editing more rigorous and better documented. It has also made us think about how to convey what is good practice more effectively, bearing in mind that many people don’t have much of a sense of what might be needed for Linked Data.

However, the other side of the coin is the realisation that  you cannot clean your data perfectly. We have over 25,000 collection descriptions and many 100,000′s of lower level entries. It is likely that we will have to live with a certain level of variation, because some cleaning up would be very hard to do other than manually. Our data will always come from a variety of sources, and it may actually be that our move towards importing data from other systems actually introduces more variation. For example, I recently found that a number of descriptions from one contributor, exported from another system, did not provide the ‘creator’ entry as a structured access point (index term).  This is a disadvantage with Linked Data, where you are trying to uniquely identify individuals and match that name to other instances of the same person.

Data cleaning can sometimes feel like a can of worms, and I warn those with similar aggregated data, or data from different sources, that dealing with this can really start to eat away at your time! I would certainly advise starting off by thinking about workflow for data cleaning – the reporting, decision making, documenting, addressing, testing, signing-off – whatever you need to do. In retrospect I would have started a spreadsheet straight off. But, overall I think that it has been good for us to think more carefully about our data standards and how we can improve consistency. I feel that it’s something we should address, whether or not Linked Data is involved, because it increases the potential of the data, e.g. for creating visualisations, and it generally makes it more interoperable.

 

 

 

Posted in barriers, data cleaning, data processing, identifiers, linked data | 1 Comment

Unique Identifiers for Archives in a Linked Data World

Our Linked Data work has thrown up a significant number of challenges around the consistency and structure of the source data from the Archives Hub, and nowhere more so than around identifiers for the archival resources, that is, the references used for the archives at all levels of description, be it collection, series, file or item.

Identifiers on the Hub

Identifiers serve two distinct purposes on the Archives Hub:

(i) the identifier for the archive itself – the reference for the actual collection or sub-collection. This is contained within the ‘unitid’ tag.

(ii) the identifier for the description of that archive – the finding aid. This is contained within the ‘eadid’ tag.

a) Identifiers for the Description of the Archive

The eadid tag consistently contains attributes for the country and the agency that maintains the description. This information is also given within the content of the tag.  The Hub URI is created by converting the reference to lower case, and converting slashes to dashes:

<eadid countrycode=”GB” mainagencycode=”1234″ identifier=”JaB/A”>GB 1234 JaB/A</eadid>

becomes

http://archiveshub.ac.uk/data/gb1234jab-a

b) Identifiers for the Archive

We display the identifier for the archive within the description. So, to a degree, the way this identifier is structured in the data is less important, as long as we display it to the researcher.

The identifier for the archive is typically the same as the identifier for the description, including a code for the country and a code for the repository as well as the local identifier for the archive, although they serve different purposes:

Reference: GB 1234 JaB/A

At the top level, things can seem relatively straightforward.But, bear in mind that on the Hub the primary role of the ‘unitid’ reference is to be a visual indicator of the reference – the important thing is what displays to the end-user, so a level of inconsistency in the make-up of the unitid might not be a problem as long as we display the correct reference.

If you look behind the scenes, there is a lack of consistency in the structure of these  identifiers. The country code and repository code may exist within the content (which is displayed), or they may exist as ‘attributes’ – which provide additional information that is not part of the content (and which can be displayed, but may not be), or they may exist as both. Occasionally they are not present at all.

For those that are familiar with XML markup, I mean that we could display a reference such as ‘GB 0982 UWA’, but there are various ways the data may be structured:

(1) <unitid countrycode=”gb” repositorycode=”0982″ identifier=”UWA”>GB 0982 UWA</unitid>

(2) <unitid>GB 0982 UWA</unitid>

(3) <unitid countrycode=”gb” repositorycode=”0982″>UWA</unitid>

(4) <unitid>UWA</unitid>

Even if you are not familiar with XML, you can see the way that the content is the same (apart from the last example) but the way it is structured differs. However, as long as we can display ‘GB 0982 UWA’ on the Archives Hub we are OK with this. We have ensured that our stylesheet copes with a number of different options, bearing in mind this is just for what displays through a Web browser.

c) Identifiers for the Archive at Lower Levels

On the Hub,  lower levels are assigned persistent identifiers in a similar way to collections. A component’s identifier is that of its parent record (i.e. the content of the eadid tag), followed by a hyphen, then the unitid of the component.

There is no lower-level eadid – the only identifier at lower-levels is the unitid. So, we use the collection-level identifier for the description along with the lower-level identifier for the archive as the unique identifier for the lower level description (eadid + lower-level unitid):

<eadid countrycode=”GB” mainagencycode=”1234″ identifier=”JaB/A”>GB 1234 JaB/A</eadid>

and lower-level

<unitid> GB 1234 JaB/A/3/1</eadid>

Would have the URI of:

http://archiveshub.ac.uk/data/gb1234jab-a-gb1234jab-a-3-1

Maintaining the Distinction Between Identifiers

The Hub has, in a sense, tended to conflate the identifier for the archive with the identifier for the finding aid that describes the archive. Not in the sense of what their function is within the Hub, but more in the way that we took the decision to recommend that these two identifiers should be the same, which makes good sense most of the time. This means that it can be harder to convey the different purposes of the two identifiers. For most of the time this is not a problem, but I think many archivists do think that they are creating one identifier, and don’t think about the distinction between identifying an archive and the description of an archive.

Creating Linked Data Identifiers

Within our Linked Data the identifiers for the descriptions are the Archives Hub URI’s, so that we link back into the Hub from the Linked Data.  The challenges are around the URIs for the archive collection or sub-collection.

Initially we used the eadid content in the URI for the archive collection identifier, so for example:

eadid reference = GB 1234 JaB
URI = http://data.archiveshub.ac.uk/id/archivalresource/gb1234jab.

The ‘JAB’ may also be the identifier for the archive, but in this case it is being used as a unique identifier for the description.

The reason that we used the eadid was because it is definitely unique. The Hub requires all eadid’s to be unique. However, there are two main issues with this:

1) The eadid is the identifier for the description, not the archive collection.

2) Sometimes the agency that maintains the description is not the same as the agency that holds the archive. This is reflected in a different code used in the eadid (to reflect the agency that created the description) and the unitid (to show the repository that holds the archive).

For example:
eadid = GB 133 PPL – an archive maintained by ’133′, which is the John Rylands Library
unitid= GB 135 PPL – an archive stored at ’135′,  The Methodist Archive (part of the Library, but a separate entity)

In this case, we have to maintain the difference between the eadid and unitid because they are telling us different things.

It was for these reason that we felt we should create the URI for the archive from the identifier for the archive, which is the content of the ‘unitid’ tag.

URIs for the Archive at Collection Level

At the top level, things can seem relatively straightforward. Examples of unitid:

GB 1086 Skinner
GB 0532 cwlmga

These are neat examples, and we can translate these into nice Linked Data URIs for the archival resource:

http://data.archiveshub.ac.uk/doc/archivalresource/gb1086skinner

http://data.archiveshub.ac.uk/doc/archivalresource/gb0532cwlmga

URIs for the Archive at Lower Levels

For lower-levels the unitid entries can be quite complicated, although they should work fine if the country code and repository code are included in some way:

GB 2008 TAS1/1/1
could be:

http://data.archiveshub.ac.uk/doc/archivalresource/gb2008tas1-1-1

But on the Hub, as I have said, we combine the eadid for the top level with this lower level  unitid, for reasons to do with trying to ensure the reference is unique and ensuring that the country code and repository code are incorporated, so

eadid: GB 2008 TaS
unitid: GB 2008 TAS1/1/1
would be:

http://data.archiveshub.ac.uk/doc/archivalresource/gb2008tas-gb2008tas1-1-1

or
eadid: GB 2008 TaS
unitid: TAS1/1/1
would be:

http://data.archiveshub.ac.uk/doc/archivalresource/gb2008tas-tas1-1-1

Problems with Using the Unitid for the Identifier

1) Attributes for the Country and Repository

The fact that the unitid doesn’t always contain the country code and repository code, or they may be present as attributes or as content, or both, is problematic for Linked Data, where the identifier for the archive needs a unique URI and the attributes help to provide this.

2)  Maintaining the Distinction between Identifiers

As set out above, the Hub has recommended that contributors use the same identifier for the finding aid as the identifier that describes the archive (unless the agency is different). The different function of these two identifiers is still preserved within the Hub, but using the same data for them works perfectly well.

However, we should not actually assume that the eadid’s mainagency code and the unitid’s repository code are the same. This is because the code for the eadid agency refers to the agency responsible for the description, and the code for the unitid refers to the repository responsible for the archive. They are usually the same, but they are not necessarily the same. If we want to make statements about our content, such as ‘this archive is held at this repository’ and ‘this person was responsible for creating this description’ then the distinction becomes important as they may be different agencies.

The agencies can also change over time. So, if you take the example of the Papers of Dr Thomas Coke. These papers could be held at the Methodist Archive (code 135) and described by a minimal EAD doc maintained by the Methodist Archive so eadid/mainagencycode = unitid/repositorycode = 135

But then at some point maybe John Rylands University Archive revises the description and extends it (maybe the first one was just collection-level and now it is multi-level).  So eadid/mainagencycode = 133 and unitid/repositorycode = 135

The archival stuff hasn’t changed, and it is still held/curated by Methodist Archive, so the URIs of the archival stuff shouldn’t change, even where the description is attributed to a different repository. This means that we shouldn’t rely on eadid content/attributes in constructing URIs for the holding repository.

This whole situation can be complicated by the fact that sometimes the unitid does not contain the repository code at all, so the only code we have is from the eadid, and we have to assume they are the same agency.

3) Inconsistencies in the Data

As stated above, the unitid does not always contain attributes, and so entries vary quite widely. This is mainly as a result of data coming into the Hub from many different sources over a period of time. Many descriptions were created in other systems, and it is always a challenge to move data between systems and end up with something consistent and fit for purpose. Many descriptions were created in something like Word originally, and so issues such as unique identifiers for URIs were not in the game plan at the time. In general, the eadid entries are more consistent and easier to work with than the unitid entries for multi-level descriptions.

4) Unitid may not be Unique

We have hit problems with the untid not being unique throughout the Hub, mainly for lower-level descriptions, and this is the most significant problem. For the Hub, the only identifier that has to be unique is the identifier for the description, the eadid, because this is what the Hub works with – the Hub essentially works with the description.

Process to Identify Duplicate Unitid’s

Pete created an analysis of unitid content and attributes using XSLT, a nifty piece of work that allowed me to see exactly where the duplicate identifiers are. We found that in general duplicates apply to both the raw EAD and the identifiers created by the Archives Hub. But the transformation process, whereby the Hub converts to lower case and uses ‘-’ instead of ‘/’,  can create a duplicate where it did not exist in the raw EAD, such as for StoneR and Stoner (both become ‘stoner’), or for MS/1 and MS-1 (both become ms-1).

With this spreadsheet I was able to order the entries by ‘clashes’ within the identifiers and then decide how to tackle them.

Case Distinction

The ‘Stoner/StoneR’ problem is tricky. One way to get round it would be to go back to URIs that are case sensitive, but this does seem like a retrograde step. Another option would be to work with the contributors to avoid using case as a means to distinguish references, and I think this is what we will do. I think they will be happy to work with us on this, so that we can stick to lower-case URIs but avoid duplication.

Distinguishing Repository Code and Reference

There is an issue with identifiers where the local reference starts with a number, e.g. http://archiveshub.ac.uk/data/gb1067esw.  It looks like the repository code is ’1067′ but actually it is ’106′ and the reference is ’7ESW’.  This could potentially be a problem if the repository codes of  ’106′ and ’1067′ were both used on the Hub.

We are starting to think about converting to URIs with hyphens to show the three parts of the reference, e.g. http://archiveshub.ac.uk/data/gb-106-7esw.

Former Reference

We had to deal with filtering out ‘former reference’ which many contributors include as a second unitid entry, particularly where the descriptions come from other systems.  This is usually a relatively straightforward process, although slightly complicated by the fact that in a few instances the attribute value ‘former reference’ isn’t used. E.g:

<unitid label=”former reference”>http://archiveshub.ac.uk/data/gb123abc</unitid><unitid>Former reference: http://archiveshub.ac.uk/data/gb123abc</unitid>

The latter example can create problems for us. However, as with so many things, there is one further issue here: some contributors actually want the former reference (what they might call ‘alternative reference’) to be the current reference – in this case we are stumped! The only option would be to edit the Hub descriptions prior to creating the Linked Data.

Changes to References

In the end, even once we face all of the other issues with identifiers, we also have the risk that an archive repository will choose to change its reference. This is not common, but it does happen. We could probably find a way to ‘archive’ the old reference and have some kind of link between the two. The horror scenario would be if the repository then used the old reference, which to them may now be redundant, for a new accession.

Reasons for Duplication

The unitid is usually duplicated due to human error, rather than as part of the system of cataloguing.  It isn’t surprising when you consider the level of detail and length of some descriptions, and that they are created with various software applications, some of which don’t really help with creating good identifiers. You find entries with references such as:  JAB/1/2(i)/2/2, JAB/1/2(i)/2/3, JAB/1/2(i)/3, JAB/1/2(i)/5. Once you identify where the mistake is, you can simply correct it.

In one interesting example for a large collection, the duplicate identifier was purposely used, because the higher level entry described a person and the entry below that described the ‘stuff’ about them. So, something like ABC/JB for an entry about John Bunn in the ABC collection, and ABC/JB for a further entry for a file of stuff about John Bunn. You can see how in the display this works OK, although it means each entry has the same URI, but in terms of Linked Data it is a problem.

Conclusion

Our plan at present is, as far as possible, to correct the data, rather than try to work around it. This means some ‘persistent’ URIs will change, but it seems worth the end result of ensuring unique identifiers. We will have to run the analysis on all of our data in order to pick up any issues that need addressing. This has the advantage that we can also pick up other problems with the identifiers, such as the odd incorrect repository code, or attributes absent from eadid entries.

It has been crucial for Pete, who has done the processing, to really understand how URIs are constructed, and in doing this it has become apparent how our system has been skilfully developed by John Harrison, our developer, to cope with many variations, as you have to with a large aggregator.  It would have been extremely difficult to ensure a level of rigour in the data initially, when we were taking in such a diversity of content and focussing on building up the service. In addition, we could not plan for ‘cool URIs’ or, of course, Linked Data.

For Linked Data, rigour is important because you are drawing out so many different entities within the world you are describing, and they all need unique and dereferenceable http URIs. The question is, do you put the effort into introducing more rigour into your source data – is it worth the investment? It is certainly often time-consuming and can be quite a difficult process with so many variables to think about. I think that it generally is worth doing, because more consistent data with properly constructed identifiers must be a good thing, not just for Linked Data but for the whole open and interoperable agenda.

Posted in barriers, data processing, identifiers | 2 Comments

Archives & Linked Data Meetup

With the various Linked Data projects that are happening around Linked Data in archives, I thought the time seemed right to get together and share what we have done, what we think and what we see as the challenges around this work. The Locah project team were particularly keen to find out what the AIM25 team had been up to with their Open Metadata Pathway and Step Change projects, but it seemed well worth broadening the invite out to other projects, so I put out an invitation to other projects and through the archives email list. In the end we had several projects represented:

Locah / Linking Lives: a project to create Linked Data for EAD archive descriptions on the Archives Hub. We have provided a stylesheet that converts EAD to XML RDF as well as a Sparql endpoint and Linked Data views of our data. We did a significant amount of work around data modelling, the use of RDF vocabularies and creating external links, and we blogged in detail about the processes and issues involved in creating Linked Data from complex hierarchical archive descriptions. We are now working on an interface to show how the Linked Data can be used to bring different resources together.

Open Metadata Pathway / Step Change : Work around the use of OpenCalais and the UKAT thesarus (subjects and names) to extract entities from data and enable URIs to be created. A tool is being developed to allow archivists to do this at the time of cataloguing and the project is working with Axiell CALM to embed this into the CALM software and display via CALMView.

SALDA: A project to output the Mass Observation archive as Linked Data and enhance the data. This built on the Locah work.

Bricolage: will publish catalogue metadata as Linked Open Data for two of its most significant collections: the Penguin Archive, a comprehensive collection of the publisher’s papers and books; and the Geology Museum. It intends to build on previous work done by Locah and SALDA.

Trenches to Triples: will provide Linked Data markup to both collection level descriptions and item level catalogue entries relating to the First World War from the Liddell Hart Centre for Military Archives. It will also provide a demonstrator for using Linked Data to make appropriate connections between image databases, Serving Soldier, and detailed catalogues.

The majority of those attending represented these projects. There was also representation from the DEEP project to digitise English place names and make them available as structured data. Other attendees represented the museum sector and The National Archives. In addition, we brought developers together with archivists and managers, and I think we managed to strike a good balance in our discussions so that they were of benefit to everyone.

In the morning we shared information about our projects. This gave us a chance to ask questions and get a clearer understanding of what people have been doing in this space. A number of issues were presented and discussed.

Extracting concepts from data

We were given a demonstration of the prototype cataloguing interface developed by the OMP project and now being developed under Step Change. It uses OpenCalais to extract concepts from archive descriptions, which tend to be quite document-centric, and contain large chunks of text, particularly in the biographical history and the scope and content sections. The idea is to provide URIs for these concepts, so, for example, OpenCalais highlights a name within a paragraph of text, such as ‘Architectural Association’, and you can then confirm that this is an organisation and it is relevant as an index term so that it is marked up appropriately. The tool is being tested with archivists because ease of use is going to be key to its success. We discussed the limitations of the OpenCalais vocabulary – it does really clever data analysis, but it isn’t geared up for historical data sources. UKAT is much broader and more suitable for archive descriptions – it would be good to integrate this vocabulary into OpenCalais.

Solving the challenges of archive descriptions

We discussed some of the challenges that Locah has faced with processing multi-level archive descriptions, challenges such as: duplicate identifiers for different resources (especially within the same description – more to come on this issue on the Linking Lives blog); creating URIs for data such as extent of archive (where you may have ’10 boxes’, but you may also have ’2 boxes, 10 large photographs and a reel of film’); inheritance of data from the top level down through the levels of a description (which is problematic within Linked Data) and matching names on the Archives Hub to names on VIAF (which we’ve had reasonable success in doing, though in archives names can be quite problematic, such as ‘John Davis, fl. 1880-1890′).

Working with one dataset versus working with a large aggregation

We thought about the comparison between creating a Linked Data output for the Archives Hub, which aggregates data from hundreds of archives, and creating it for just the Mass Observation Archive. Whilst the scale you are working with is appealing with a large aggregator (potential to create Linked Data for all these repositories), working with a discreet collection gives you more control to be able to interpret things within the data (for example, the date may always appear in a certain place, so you can confidently mark up the date as a date and give it a URI).

This led us into some discussion around the way that creating Linked Data can really highlight problems within the data source, and it may provide impetus to address these problems, thus improving the source data by making it more consistent or clarifying the meaning of elements within the data.

Integrating Linked Data into the Workflow

The Step Change project is particuarly focussed on the challenge of making this kind of semantic markup easy to achieve. It has to be well-received by cataloguers. Work is being undertaken at Cumbria Record Office to test the tool out and provide feedback. We discussed the importance of major players such as Axiell CALM embracing this kind of approach, enabling Linked Data to be created from CALM descriptions. This is not yet happening, but the Step Change project is working with CALM and so it is a good starting point. We also discussed the need for the CALM user group to think about whether they want this kind of Linked Data output from their software provider (it needs to be demand-led).

The ‘Same As’ Issue

We touched on the issues around trusted data a number of times. The SALDA project found that creating ‘same-as’ links was probably the most challenging part of the project. We agreed that we must be aware of the importance of archive descriptions being trusted sources and there has been a tendency for some data providers to use ‘same-as’ links too promiscuously. In a Linked Data context this is problematic, as you are then asserting that all of the statements made by the data source you are linking to through this relationship are true. It raises the issue of manual matching as a means to be sure your links are semantically correct, but doing this is time-consuming, so it can only be carried out in a minority of cases.

* * *

In the afternoon we had two sessions, (i) techniques and tools, and (ii) opportunities and barriers. A brief summary of some of the points that were made during the discussions:

Benefits of Linked Data

  • The principle of generic APIs – a standard way to provide access to data – it could replace the myriad of bespoke APIs now available.
  • Dataset integration – bringing data sources together.
  • The precision of information retrieval and giving researchers more potential to ask specific questions of data sources. For example, a researcher may be able to retrieve information around a specific event.
  • Embedding the expertise of practitioners was seen as something that should be a benefit, i.e. we should ensure that this happens.
  • It encourages cross-domain working.
  • It enables people to create their own interfaces and tools to utilise data sources.
  • It encourages the creation of narratives through diverse data sources.
  • It is very much an ‘anti-silo’ approach to data.

Challenges

  • Expertise required.
  • Need to clearly show the extra value it brings (e.g. above what Google offers).
  • Need clearer understanding of end-user benefits
  • Sustainability and persistence (we talked about the idea of a ‘URI checker’ and using caching to help with this).
  • Possible overload caused by large-scale searches or ‘bad’ Sparql queries.
  • Licensing, including restrictions on external data that you might want to link to within your data.
  • Choice of so many vocabularies.
  • Likelihood of not following the same kinds of practice, thus impeding the linking possibilities between datasets.

Conclusions

The group felt that a recommended approach for archive descriptions would be really useful, to facilitate others outputting Linked Data and ensure we do get the benefits that Linked Data potentially offers.

We talked about a generic stylesheet – the community has already benefitted from the data model and stylesheet developed by the Locah project, with AIM25′s OMP project and SALDA both using it for their projects, and Bricolage looking at it for their Linked Data work, but there are still issues around the diverse nature of the data, so a stylesheet to transform EAD descriptions to RDF XML may be a great start for many projects, but modifications are almost inevitable, and expertise would be required for this.

We did decide that a possible way forwards would be what one attendee called a ‘lego approach’, where we think in terms of building blocks towards Linked Data. The idea would be to work on discreet parts of a data model and to recommend best practice. For example, one area would be the relationship between the resource and the index terms or access points. Another would be the relationship between the resource and the holding institution.

This approach should be cross-domain, bringing archives together with museums and libraries. We could look at parts of the model in turn and decide what the relationships are, whether they are consistent across the domains and which vocabularies would be appropriate. The idea would be to end up with a number of ‘RDF fragments’ that people could use, but with the flexibility to extend the model to meet their requirements.

We are hoping to discuss this proposal more and think about what would be required in order to achieve this kind of co-ordinated approach. Obviously we would immediately need to get buy-in across the three domains. Our meeting was representing archives, but this approach would require a very collaborative effort. However, the advantages are very clear, and it does seem to achieve a balance between the challenges of a completely interoperable solution versus the disadvantages of each domain working out different models and using different vocabularies.

 

 

Posted in barriers, benefits, cross-domain, linked data | 2 Comments