URIs, identity, aliases & “consolidation”

Jane has written a few posts recently on our efforts to improve the stability of URIs used for pages about archival resources on the “live” Archives Hub service, and as far as possible we’ll be trying to reflect the changes made there in the URIs we use in the Linked Archives Hub RDF data. Much of that work has led to a review of the conventions used in the source EAD XML data and a concerted effort to “cleanse” or enhance that data to improve its coherence and consistency.

In this post, I’ll focus on some issues around the URIs used to identify Persons in the Linked Archives Hub data. It’s something I’ve been trying to write on and off over a period of several weeks, and a combination of some work I’ve been doing for the Bricolage project, and some subsequent conversations, have prompted me to try to knock my rather rambling drafts into shape.

The Data Transformation Process

It’s probably worth taking a step back and emphasising that the process by which the Linked Archives Hub RDF data is generated is currently a relatively simple one:

  • EAD XML documents are transformed into RDF/XML using an XSLT transform. This process is performed on a “document-by-document” basis, i.e. it has as input a single EAD document and an XSLT stylesheet and outputs a single RDF/XML document; the process does not have any “knowledge” of the other EAD documents within the dataset to be transformed.
  • The output from the transform is uploaded to a triple store.
  • Some supplementary data is uploaded alongside the data derived from the EAD documents. This data is the product of various processes: some is “hand-crafted”; some is imported from external sources; some is the result of processes run over the EAD-derived data; some is the result of “lookups” against external datasets – but for the purposes of this discussion, the key point to note is that it is “added to” the EAD-derived data, and that EAD-derived data itself is not changed.
  • That data is served as “Linked Data” “bounded descriptions”.

The URIs used to identify persons in the Linked Archives Hub dataset have their origins in the names of persons occurring in the Archives Hub EAD XML documents. Within those documents, person names occur in two contexts (or at least the EAD-to-RDF transformation process currently takes into account occurrences of names in two contexts). I’ll describe here how the conversion process handles this data, what RDF data is generated and then look at some of the issues this raises.

The examples I’ll use are all from the small subset of EAD documents included in the current Linked Archives Hub data. I’ve picked the case of Beatrice Webb, which illustrates several of the variations which can occur and the issues which arise.

Personal names as index terms

The first context is that of personal names added to the description by the cataloguer as “index terms” on the basis that they may be useful for the purposes of retrieval/search/browse. In the Hub EAD documents, they occur in XML structures like the following, using the EAD controlaccess element. In its simplest form, this looks like:

<ead>
  <eadheader>
    <eadid></eadid>
    ...
  </eadheader>
  <archdesc level="fonds">
    <did>
      <unitid></unitid>
    </did>
    ...
    <controlaccess>
      <persname source="nra">(name)</persname>
    </controlaccess>
  </archdesc>
</ead>

In some (but not all) of the Hub EAD documents, a convention employing the emph element and emph/@altrender attribute is used to capture the distinction between the component parts of a name constructed according to a name rules system – this is something local/”proprietary” to the Hub application (and really a “redefinition” of the EAD tag semantics): a “standard EAD” application would not interpret the markup in this way.

<ead>
  <eadheader>
    <eadid></eadid>
    ...
  </eadheader>
  <archdesc level="fonds">
    <did>
      <unitid></unitid>
    </did>
    ...
    <controlaccess>
      <persname source="nra">
        <emph altrender="surname">Webb</emph>
        <emph altrender="forename">Martha Beatrice</emph>
        <emph altrender="dates">1858-1943</emph>
        <emph altrender="epithet">Social Reformer</emph>
      </persname>
    </controlaccess>
  </archdesc>
</ead>

Within the subset of the Hub EAD data currently transformed into RDF, this same “index term” – same XML fragment – is used in three different EAD XML documents :

In this example, the persname/@source attribute is used to capture the name of a “name authority file” from which the name is drawn, the “nra” value here indicating the use of the National Register of Archives (NRA). The NRA itself is not currently available as Linked Data, so does not provide URIs for the entities described. The NRA record for Beatrice Webb is http://www.nationalarchives.gov.uk/nra/searches/subjectView.asp?ID=P29999. In fact, the actual form of the name used in the authority record (“Webb, Martha Beatrice (1858-1943) nee Potter, Social Reformer”) does appear to differ slightly from that used in these three EAD documents (i.e. it includes “nee Potter”).

As I discussed on the LOCAH project blog, in our mapping of the EAD data into an RDF representation, from this XML structure we generate two resources to try to capture the distinction between the person and the “conceptualisation” of that person reflected in the authority file entry or the use of the name rules. The two resources have distinct URIs and are linked using the foaf:focus property.

The patterns for the URIs for both the concept and the person are similar, and based on a combination of:

  • the name of the authority file or (see below) of the name rules
  • a “slug” derived from the the name itself (including life dates, titles, epithets etc)

So for the cases above the Person URI generated is:

  • http://data.archiveshub.ac.uk/id/person/nra/webbmarthabeatrice1858-1943socialreformer
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

<http://data.archiveshub.ac.uk/id/concept/person/nra/webbmarthabeatrice1858-1943socialreformer>
  a skos:Concept ;
  rdfs:label "Webb, Martha Beatrice, 1858-1943, social reformer" ;
  foaf:focus
    <http://data.archiveshub.ac.uk/id/person/nra/webbmarthabeatrice1858-1943socialreformer> .

<http://data.archiveshub.ac.uk/id/person/nra/webbmarthabeatrice1858-1943socialreformer>
  a foaf:Person ;
  rdfs:label "Webb, Martha Beatrice, 1858-1943, social reformer" ;
  foaf:name "Martha Beatrice Webb" ;
  foaf:familyName "Webb" ;
  foaf:givenName "Martha Beatrice" .

In other cases, the persname/@source attribute is not present, but instead the persname/@rules attribute is used to provide the name of a set of “name rules” under which the name is constructed. The example below refers to the use of “ncarules”, i.e. the National Council of Archives’ Rules for the Construction of Personal, Place and Corporate Names.

<ead>
  <eadheader>
    <eadid></eadid>
    ...
  </eadheader>
  <archdesc level="fonds">
    <did>
      <unitid></unitid>
    </did>
    ...
    <controlaccess>
      <persname rules="ncarules">
        <emph altrender="a">Webb</emph>,
        <emph altrender="forename">Martha Beatrice</emph>
        <emph altrender="dates">1858-1943</emph>
        <emph altrender="other">nee Potter</emph>
        <emph altrender="epithet">social reformer and historian</emph>
      </persname>
    </controlaccess>
  </archdesc>
</ead>

This form is present in seven EAD documents:

and is mapped by the transform to the URI

  • http://data.archiveshub.ac.uk/id/person/ncarules/webbmarthabeatrice1858-1943neepottersocialreformerandhistorian

A second form of the name, also constructed using NCA Rules, but with a variation in the epithet and the “nee Potter” omitted, is also used:

<ead>
  <eadheader>
    <eadid></eadid>
    ...
  </eadheader>
  <archdesc level="fonds">
    <did>
      <unitid></unitid>
    </did>
    ...
    <controlaccess>
      <persname rules="ncarules">
        <emph altrender="a">Webb</emph>,
        <emph altrender="forename">Martha Beatrice</emph>. (
        <emph altrender="y">1858-1943</emph>)
        <emph altrender="epithet">social reformer</emph>
      </persname>
    </controlaccess>
  </archdesc>
</ead>

This appears in one EAD document:

and is mapped to the URI:

  • http://data.archiveshub.ac.uk/id/person/ncarules/webbmarthabeatrice1858-1943socialreformer

There are a few points worth noting here:

First, and most obviously (and this was the point that initially prompted me to start writing this post), the fact that different forms of name can – quite legitimately, within the constraints of the EAD format and the Hub data entry guidelines – be used as index terms to refer to the same person across the dataset means that we end up generating through our transform process – and publishing/exposing to the Web in our data – multiple URIs for the same person. From the cases above, we have three distinct “URI aliases” for Beatrice Webb:

  • http://data.archiveshub.ac.uk/id/person/nra/webbmarthabeatrice1858-1943socialreformer
  • http://data.archiveshub.ac.uk/id/person/ncarules/webbmarthabeatrice1858-1943socialreformer
  • http://data.archiveshub.ac.uk/id/person/ncarules/webbmarthabeatrice1858-1943neepottersocialreformerandhistorian

Second, the use of the name to construct the URI is not a guarantee of avoiding URI ambiguity (i.e. of having a single URI used to refer to what are in fact two different things). In archival description data it is quite common to encounter names without complete life dates or epithets, and in a dataset the size of the Hub, it is quite possible that there are two occurrences of an index term like “Smith, John, 1945-, engineer”, both constructed using the same “name rules”, which are intended as references to two distinct individuals but would be mapped to the same URI.

Third, the “repeatability” of the transformation process over time is not guaranteed. If any of the name components changes in the EAD document (e.g. a previously unknown date of death is added, or an “epithet” is added or removed), then the subsequent re-transformation of the data will generate a different URI from that generated from the previous process using the initial form of the name. (Is “Scott, James, 1950-2012, biologist” in this version the same person who was referred to as “Scott, James, 1950-, scientist” in a previous version?)

Fourth, for both URIs, that of the Concept and of the Person, the URI includes the name of the “authority file” or name rule system.

I’m willing to concede that for the Person case this may be “overkill”. I think I chose this because I was wary of conflating what were in reality two different persons based on matches in their names. So, on this basis, it should not be automatically assumed that the same form of name in two different authority files refers to the same person, at least not without some human verification – though having said that, if there is a match on “life dates” and “epithets”, then it seems highly probable that they do.

Similarly with the name rule systems case. The situation here is probably even more complex, as in archival description data it is quite common to encounter names without complete life dates or epithets. I also wondered if it was theoretically possible that under two different name rule systems, different surname/forename ordering rules might result in two quite different names mapping to the same string in the URI. e.g. forename = James and surname = Scott under a surname first rule would result in “scottjames….” and forename = Scott and surname = James under a forename first rule would also result in “scottjames….”.

So, in short, retaining the name of the name rules or the authority file as part of the Person URI was part of an attempt to avoid accidentally conflating what may be two different person, i.e. to reduce instances of the second problem above, though this very tactic potentially contributes to the first one!

Personal names as names of the creators/”originators” of archival resources

The second context in which personal names are found is as the names of agents responsible for the creation or bringing together of the resources described. In the Hub EAD documents, they occur in XML structures like:

<ead>
  <eadheader>
    <eadid></eadid>
    ...
  </eadheader>
  <archdesc level="fonds">
    <did>
      <unitid></unitid>
      <origination>(name)</origination>
    </did>
    ...
  </archdesc>
</ead>

In the Hub EAD data, there is no guarantee that the data indicates whether the name is that of a person or an organisation. Although the EAD schema does support the use of the <persname> and <corpname> within the <origination> element, and indeed it is present in some Hub data, the Hub data entry tool does not provide this distinction.

While cataloguers are encouraged to provide the name of the originator also as an index term, this guideline is not always followed.

Furthermore, the Hub data entry guidelines for this element encourage the use of “the commonly used form of name”, so it may be that the form of name used here is different from that used as an “index term”, which creates potential complexity in trying to “reconcile” the two.

Beatrice Webb appears as the creator/originator of five collections:

using one of the following two XML structures:

<ead>
  <eadheader>
    <eadid></eadid>
    ...
  </eadheader>
  <archdesc level="fonds">
    <did>
      <unitid></unitid>
      <origination encodinganalog="3.2.1">
      Webb, (Martha) Beatrice, 1858-1943, wife of 1st Baron Passfield, social reformer and historian
      </origination>
    </did>
    ...
  </archdesc>
</ead>
<ead>
  <eadheader>
    <eadid></eadid>
    ...
  </eadheader>
  <archdesc level="fonds">
    <did>
      <unitid></unitid>
      <origination encodinganalog="3.2.1">
      Webb, Martha Beatrice, 1858-1943, wife of 1st Baron Passfield, social reformer and historian
      </origination>
    </did>
    ...
  </archdesc>
</ead>

The name-to-URI mapping algorithm discards the parentheses so both cases map to a single URI:

  • http://data.archiveshub.ac.uk/id/agent/gb97/webbmarthabeatrice1858-1943wifeof1stbaronpassfieldsocialreformerandhistorian

Post-transform processing

After this EAD-derived data is uploaded to the triple store, some further processes are applied:

  • a “lookup” process which extracts information about “persons” in the Hub data and searches for candidate matches in the VIAF dataset
  • a process which seeks candidate matches within the Hub dataset between “agents” (generated from the creator/origination context) and “persons” (generated from the index terms context)

The result of this is the addition of a set of triples with owl:sameAs predicates to indicate that the various data.archiveshub.ac.uk URIs (and the VIAF URI) identify the same person.

One of the problems with this approach is that an application consuming the data still has to be prepared to work with these multiple URI aliases, and particularly with SPARQL, this can be quite cumbersome: given URI X denoting a person, to find all the data we hold about the person, an application has to search for patterns involving not just that known URI X but also any URI Y, where URI Y sameAs URI X.

Materialising inferences?

As a possible further measure to mitigate these difficulties, we might perhaps take the approach of further “materialising inferences” based on these owl:sameAs predicates, i.e. explicitly adding to the data the further set of triples which can be inferred from those triples. While this would facilitate querying, it increases the size of the dataset and also (from a “provenance” perspective) adds to the complexity of managing how we distinguish the different sources of data (e.g. which triples had their origin in the transformation of the source EAD documents and which were added by subsequent processes).

Consolidation and “Annotation”

I’m coming to the conclusion that what while our current process is “OK-ish” as a first stab at generating an RDF representation, the “repeatability” issue (change in name resulting in change of URI) is a problem, and these multiple URI aliases in the published data is, while not strictly “wrong”, at best rather “sub-optimal” for consumers of the data.

The “repeatability” problem is the consequence of our basing the “slug” in the Person URI pattern on data attributes that can change over time. At the time the transform is applied, the only data that is available is the name (and the associated attributes), so I’m not sure I have a good answer to this. One approach would be to see the transformation stage as only the first part of a larger process, to keep track of the URIs generated over time, and build in a stage of processing to reconcile the URI generated from “Scott, James, 1950-2012, Sir, biologist” this week from that URI generated from “Scott, James, 1950-, scientist” in the previous version of the document six months ago. This perhaps then becomes simply a special case of the second problem, of dealing with multiple URIs for a single entity.

On the second problem, given the nature of our input data, it may well be a necessary part of the process that the initial transformation stage does result in multiple URIs. But once we’ve applied the post-transform processing to “reconcile” these references, rather than publishing a set of sameAs triples, maybe we should take a step further and consider “consolidating” our data to use a single URI for the person?

So e.g., if our post-transform processing tells us that, as I describe above, we have four distinct data.archiveshub.ac.uk URIs which all refer to the person Beatrice Webb, should we “distill down” to one of those four, and replace the occurrences of the other URIs in the data?

Furthermore, if we know that the content of any name is potentially unstable (i.e. “Scott, James, 1950-, scientist” can be replaced by “Scott, James, 1950-2012, Sir, biologist”), should we be using this as the basis of a URI at all, even in the case where – at this point in time – it is the only name for that person in our dataset? Should we instead manage a mapping to some sort of code and use that to construct a distinct URI again? The challenge is in creating a process/workflow which makes this easy to do, and repeatable if/when data is reprocessed or new data is added.

A further possibility is suggested by a post by Leigh Dodds which I’ve had at the back of my mind for a while, and which he mentions again in a more recent post.

Leigh argues that as Linked Data providers we tend to publish data using our own URIs, then “reconcile” some of those URIs with some existing published URIs for the same entity created by other providers, and add owl:sameAs assertions to indicate that they are co-references – much as I’ve describe here for the Linked Archives Hub case. But an alternative approach in which, instead of publishing our own URIs, we use those existing URIs directly in our own data may well make our data easier to use. Leigh refers to this as “Annotated Data” – in the sense that we are providing new triples using an existing URI. Applying this to our concrete example for Beatrice Webb, if, as I suggest above, it would be a Good Thing to “distill down” our four different URIs for Beatrice to a single URI and substitute that single URI in our data, could we use, say, VIAF’s URI for her for that purpose?

In fact, we already make use of externally-owned URIs directly for the case of languages, where we simply use lexvo.org URIs directly in our data. One motivation for choosing this approach was that it was trivial to construct the lexvo.org URIs in the transform process using the language codes present in the EAD data. Obtaining a VIAF URI for a person, on the other hand, is a rather more complex task involving a search of another dataset and (in some cases, at least), a process of manual verification of candidate matches. But in spite of the difference in the processes of obtaining the URIs, are the two cases so distinct? Particularly if we start to think of our data publication as rather more of a multiple-stage process, I admit I’m less sure than I might have been at one point.

One factor might be our level of confidence in the stability of any external URIs we use. I’m not sure VIAF has published any formal policy regarding its URIs. But on the other hand, part of the problem that we are grappling with is that of maintaining the stability of our own URIs!

Another factor is that the consequence of adopting the “annotation” approach is that when it comes to dereferencing URIs, we would no longer have a data.archiveshub.ac.uk “Person URI” which we can redirect to a document/graph that we serve. Obviously, the VIAF URI for Beatrice Webb redirects to a document served by VIAF – which would not provide the information that, say, she was the creator/originator of the five archives above or the “foaf:focus” of those concepts associated with the eleven other archives. That information is still present in our dataset, and would be available via SPARQL, and as part of other Linked Data documents we do serve (e.g. the bounded description of the archival resource would include a triple indicating its creator/originator). In principle, we could also, as Leigh suggests in the penultimate paragraph of his post, continue to serve a document providing a bounded description, much as we do now, but its subject would be <http://viaf.org/viaf/86607236> rather than a data.archiveshub.ac.uk URI. The challenge then becomes one of how to make that document discoverable (through foaf:isPrimaryTopicOf/wdrs:describedby/rdfs:seeAlso links? through third-party services built on such links?)

I admit I hesitate to advocate taking this plunge at this point. The cases of the Language URIs and the Person URIs do seem to be different – although in ways I’m not sure I can articulate very clearly! Using the lexvo.org Language URIs seems appropriate in part because it doesn’t seem like we have “anything interesting to say” about languages, but the person case feels more “core” to “our data”. Also we will almost certainly always have to handle cases for which VIAF doesn’t provide a URI and we need our own Person URI. On the other hand, if, say, the National Register of Archives “authority file” data had already existed as Linked Data, and provided URIs for persons, would we still coin our own URIs for those cases? Or would we have simply adopted their URIs wherever we could? I’d hope we’d have chosen the latter. Maybe we really do need to become more relaxed about embracing the use of others’ URIs.

So… I think we need to think more about whether to take that step of using external URIs instead of our own, but I do think our URI alias issues in general need some attention, probably involving some sort of an extension to the current process to introduce a “consolidation” step between the transformation stage and the publishing stage so that where we know we are coining multiple URIs for an entity, we publish only one of them.

Posted in archival description, identifiers, linked data | Tagged , , , , , | 1 Comment

GLAM Rocks! – Libraries, Media & The Semantic Web hosted by the BBC

I had the very great pleasure of speaking at the ‘Libraries, Media & The Semantic Web’ event hosted by the BBC Academy last Wednesday, along with folks from the New York Times, the BBC, Google in the guise of Schema.org, Historypin and KONA. The event was organised by the Lotico London Semantic Web Group.

The General Manager for News and Media at the BBC, Phil Fearnley, introduced the event, and immediately caught our attention by informing us that the BBC is continuing to make a substantial commitment to semantic web technologies, having devoted 20% of it’s entire digital budget to activities underpinned by this technology. Nice one Phil.

After a few opening words from Marco Neuman of Lotico, Jon Voss was then up, giving us a briefing on Linked Open Data in Libraries, Archives and Museums (LOD-LAM) efforts around the world, and upcoming plans within the community. He talked about how the first International LODLAM Summit held in San Francisco last year has galvanised the LODLAM community, and helped kick-off a number of activities. Jon was the main convener of the summit, and kindly asked me to be on the organising committee, so, although you could say I’m biased, I can vouch for the fact that it was a great event. He also mentioned how the number of LODLAM events across the world has grown, with meetups in Australia, the UK and a number of places around the USA. Jon also talked about some recent work Historypin are doing to allow users to dig deeper into archival records based on time and place, to enhance the Historypin experience using linked data principles. He wrapped up by emphasising the importance of open licenses, and how open data has to come before linked open data.

I was up next, giving a whistle stop tour of UK LODLAM activities, myself being Adrian Stevenson, Senior Technical Innovations Coordinator at Mimas, University of Manchester. Given that I was in the vicinity of where the classic glam rock bands have played, I couldn’t resist the temptation to use the galleries, libraries, archives and museums ‘GLAM’ acronym for my presentation title, and throw in a glitter platform shoe on the opening slide. I covered the work of the LOCAH and Linking Lives projects, before giving a heads up to a number of the JISC funded Discovery projects doing linked data work, including the Bricolage project in which our own Pete Johnston is involved, and the newish World War One Discovery project I’m working on. I finished up by focussing on particular challenges we’ve met on Locah and Linking Lives, namely the difficulty of creating links based around names, and the general problem of finding data to link to.

We then moved to the media perspective on things, with Evan Sandhaus, lead architect for semantic platforms at the New York Times, giving us the low down on rNews, an embedded data standard for the news industry from the IPTC. Evan explained the ‘silly’ situation we’ve ended up in, where the data content of news articles is kept in structured form behind the scenes in databases, but this structure is lost when the data is presented to the Web in HTML. To address this weakness, the IPTC came up with the rNews data standard, which is defined as “a data model for embedding machine-readable publishing metadata in web documents and a set of suggested implementations”. Currently there are RDFa and HTML5 implementations, with a JSON implementation under consideration.

Addressing the benefits, Evan explained that rNews can provide superior algorithmically generated links, such as those generated by Google Rich Snippets, thereby improving referral traffic. In addition, it can allow for better analytics provided by the better quality data. It was noted, however, that these benefits will depend on the wide adoption of rNews in the community. He then gave a short history of the development of rNews, culminating in the announcement that it has now been adopted by the New York Times, and is used on all news articles published after 29th January 2012. Evan mentioned how the arrival of Schema.org, which essentially does the same thing as rNews, caused something of an “existential crisis”. Fortunately, the organisations have worked together, and schema.org has now been expanded to absorb about 98% of the rNews data model.

Dan Brickley from Google, working on Schema.org, gave a really interesting talk looking back at the history of search and structured data over the past 100 years. He used this as a way to highlight the connections between the GLAM sector, the media, and the problems schema.org is aiming to solve. Dan proposed the notion that somewhere in Belgium, semantic search over structured data went mainstream as long ago as 1912. He backed this up by quoting some search queries logged in the 1912 annual report from the Belgian Institute of Bibliography. Dan went on to talk about Lonclass, a BBC media archives classification system still used today. Dan suggests that Lonclass is based on structured semantic data, having compositional semantics predating computing. Using Lonclass, it’s possible to build sentences from its semantics, e.g. the lonclass code ‘656.881:301.162.721 for “Letters of apology” can be combined with the codes for ‘resignation letters’, ‘Margaret Thatcher’ etc.

Dan described how Schema.org, launched in June 2011, is essentially the result of a loose collaboration of engineering groups from Google, Bing, Yahoo & Yandex. Having been somewhat behind the scenes, they are moving increasingly to a collaboration model in the public space, the vocabulary development now being hosted by the W3C. Google Rich Snippets was cited as the best known way in which this markup is being used, and the business story is that if you use schema.org markup, your page is better described, you get more click-throughs, and people can better understand search result lists. He noted there’s also an advertising aspect, though this is not part of Dan’s work. The overarching aim is to give more accurate search results. Dan reckons schema.org counts as linked data, as the markup that describes someone, say Douglas Adams, points off to another page providing more info about Douglas Adams. Dan rounded off suggesting Schema.org is basically a dictionary of terms drawing on the everyday scenarios of search. It was interesting to note that he thinks the semantic web world is too polite in feeling the need to use other people’s terms. Schema.org is relatively ‘rude’, having about 300 terms, but he believes this makes it easier to deploy.

Silver Oliver from BBC News and Knowledge outlined how they’ve been doing ‘more of the same’, building on the semantic web work used for the World Cup and applying it to the new sports site, and the upcoming 2012 Olympics site. There’ll be representations for every athlete, medal event, venue, and so on. The underlying linked data principles are the same, i.e. tagging with HTTP URIs that are then used as hooks into the web graph. They’ll be using geonames for locations, hooked onto IOC Olympic content, which typically comes in spreadsheet form. They use Google Refine with the DERI RDF plugin to get RDF from spreadsheets, then add in other existing BBC RDF content, stitching these datasets together to create useful graphs. This approach gives the benefit of providing ‘page furniture’, for example, using information on the country Jamaica, and the IOC statistics on Jamaica’s performance in Olympics, to frame and enhance the BBC content on Jamaican athletes.

Silver mentioned that Google is their biggest data consumer, using their microdata and RDFa.  He noted that the 2012 Olympics pages will have schema.org data in, and also mentioned work using hRecipe for exposing structured recipe information: these have surfaced really well on Google.

Yves Raimond  from BBC R&D then talked about the challenge of surfacing the huge amount of excellent BBC archive content, and the challenge of making it connect with current content. The BBC has a massive archive, but tagging has only been used for a few years, and much of the archive has only very sparse and often incorrect metadata. He described how they’ve been using automated tagging with linked data URIs to make connections to current content to help push the archive to users. They’ve been trialing the approach on the World Service archive, which contains a massive audio database. They’re using a piece of software they’ve developed called ‘KiWi’, built with open source components, and some custom built alogorithms to automatically tag content. CMU Sphinx is used to create ‘very noisy‘ speech to text transcripts. More will be published on how they’re using KiWi in the next few months. Yves then showed us examples of autotagged programme content. As he noted, it appears to do a decent job, but some of the tags are wrong. He mentioned the possibility of using crowdsourced tags to improve the accuracy of the content.

That was basically it for the presentations part of the proceedings. All the speakers then came up for a short Q&A session, mainly focussed on the media side of things, and after this we headed to the nearest bar.  All in all it was a great evening, and I felt quite privileged to be part of a panel of such esteemed experts.

I’ve included the speakers slides where I’ve been able to track them down below:

Posted in archival context, linked data, open data | Tagged , , , , | 4 Comments

From EAD to Linked Data: Talk at UCL

Last Friday, at the invitation of Jenny Bunn, I visited UCL to talk to some of her postgrad students on the MA course in Archives and Record Management about Linked Data in general and the experiences of the LOCAH and Linking Lives projects in particular. I don’t think I really covered anything that we haven’t mentioned already here or on the LOCAH blog, but it gave me an opportunity to combine some general “tutorial”-ish background material with a few thoughts on some of those aspects of archival description and EAD that at times make the process of generating RDF “challenging”, and I thought I’d share the slides here (PDF).

Posted in archival description, linked data | Leave a comment

Do not underestimate cleaning your data!

In Linked Open Data: The Essentials (Bauer, Kaltenbock) The first steps given for publishing your content as LOD are:

1. Analyse your data

2. Clean your data

3. Model your data

…and it goes on to very helpfully summarise the further steps required. The steps given are typical of the advice often given about how to create Linked Data.

Under ‘Clean your data’ it states:

Data and information that comes from many distributed data sources and in several different formats (e.g. databases, XML, CSV, Geodata, etc.) require additional effort to ensure easy and efficient modelling. This includes ridding your data and information of any additional information that will not be included in your published data sets.

In retrospect, I greatly underestimated this particular step. Format is fine as far as we are concerned, but our data does come from many data sources – from over 200 sources in fact. I’m not sure about ridding the data of additional information, but for us issues around data consistency have created a very significant amount of extra work; work that I did not properly factor into the process.

Before I say any more about this, I want to make one thing clear: in talking about inconsistency and ‘errors’ in the data, I am not wanting to criticise the Archives Hub contributors at all. For a start, much of the data in the Hub was created over many years, and much has been migrated from many different systems. Secondly, we were simply not thinking in a Linked Data way 5 or 10 years ago. We didn’t necessarily prioritise consistency in instances where it now becomes much much more important. We didn’t ask for things that we now ask for, or ensure checks were made for certain data. We had other priorities, and the challenge of just creating an aggregator from scratch was pretty huge.

In Linked Data, you are bringing all (or many) of the entities within the data to the fore. In a way, it’s as if they can’t hide anymore; they can’t just sit within the context of the collection description and display themselves to users through a Web interface. They have to work a bit harder than that because essentially they all become lead players. And it feels to me as if this is what really makes the quality of the data so important.

I have recently blogged about the issue we have had with identifiers. This is probably the biggest issue we have to deal with. But others have come up. For example, some of our descriptions have ‘Reference’, as you would expect, but they also have ‘Former Reference’ (both in the same tag of ‘unitid’). The problem with this is that it is not always encoded consistently, so then it becomes hard to say ‘where X is included do Y’.

Another example is where we have two or more creators for a description. Up until now, we have simply had one field for contributors to add ‘name of creator’ (the EAD ‘origination’ tag) but that means that two or more names simply go into the same field are not made distinct in a way that a machine can process. It’s fine for display. A human knows that Dr James Laidlaw Maxwell, Dr James Preston Maxwell means two people. But it is harder for a machine to distinguish names if there isn’t a consistent separator. In Linked Data terms it may mean that you end up with a person effectively identified as ‘drjameslaidlawmaxwell,drjamesprestonmaxwell’. (The comma may seem like a reasonable separator, but often commas exist within names, as they can be inverted, and other entries don’t use a comma).

During our Linked Data work, what we have done when we find a data issue is to make a decision whether the issue will be dealt with through the conversion process or dealt with at source. In general, I think its worth dealing with issues at source, because it tends to mean the quality and consistency of the data (thinking particularly in terms of markup) is improved.

Furthermore, this emphasis on the data has led us to think quite fundamentally about many aspects of our data structure, the ways that we ask people to create descriptions and how we can improve our ‘EAD Editor’ in order to ensure more consistency – not just from a Linked Data perspective. It has contributed to a decision to make this kind of data editing more rigorous and better documented. It has also made us think about how to convey what is good practice more effectively, bearing in mind that many people don’t have much of a sense of what might be needed for Linked Data.

However, the other side of the coin is the realisation that  you cannot clean your data perfectly. We have over 25,000 collection descriptions and many 100,000′s of lower level entries. It is likely that we will have to live with a certain level of variation, because some cleaning up would be very hard to do other than manually. Our data will always come from a variety of sources, and it may actually be that our move towards importing data from other systems actually introduces more variation. For example, I recently found that a number of descriptions from one contributor, exported from another system, did not provide the ‘creator’ entry as a structured access point (index term).  This is a disadvantage with Linked Data, where you are trying to uniquely identify individuals and match that name to other instances of the same person.

Data cleaning can sometimes feel like a can of worms, and I warn those with similar aggregated data, or data from different sources, that dealing with this can really start to eat away at your time! I would certainly advise starting off by thinking about workflow for data cleaning – the reporting, decision making, documenting, addressing, testing, signing-off – whatever you need to do. In retrospect I would have started a spreadsheet straight off. But, overall I think that it has been good for us to think more carefully about our data standards and how we can improve consistency. I feel that it’s something we should address, whether or not Linked Data is involved, because it increases the potential of the data, e.g. for creating visualisations, and it generally makes it more interoperable.

 

 

 

Posted in barriers, data cleaning, data processing, identifiers, linked data | 1 Comment

Unique Identifiers for Archives in a Linked Data World

Our Linked Data work has thrown up a significant number of challenges around the consistency and structure of the source data from the Archives Hub, and nowhere more so than around identifiers for the archival resources, that is, the references used for the archives at all levels of description, be it collection, series, file or item.

Identifiers on the Hub

Identifiers serve two distinct purposes on the Archives Hub:

(i) the identifier for the archive itself – the reference for the actual collection or sub-collection. This is contained within the ‘unitid’ tag.

(ii) the identifier for the description of that archive – the finding aid. This is contained within the ‘eadid’ tag.

a) Identifiers for the Description of the Archive

The eadid tag consistently contains attributes for the country and the agency that maintains the description. This information is also given within the content of the tag.  The Hub URI is created by converting the reference to lower case, and converting slashes to dashes:

<eadid countrycode=”GB” mainagencycode=”1234″ identifier=”JaB/A”>GB 1234 JaB/A</eadid>

becomes

http://archiveshub.ac.uk/data/gb1234jab-a

b) Identifiers for the Archive

We display the identifier for the archive within the description. So, to a degree, the way this identifier is structured in the data is less important, as long as we display it to the researcher.

The identifier for the archive is typically the same as the identifier for the description, including a code for the country and a code for the repository as well as the local identifier for the archive, although they serve different purposes:

Reference: GB 1234 JaB/A

At the top level, things can seem relatively straightforward.But, bear in mind that on the Hub the primary role of the ‘unitid’ reference is to be a visual indicator of the reference – the important thing is what displays to the end-user, so a level of inconsistency in the make-up of the unitid might not be a problem as long as we display the correct reference.

If you look behind the scenes, there is a lack of consistency in the structure of these  identifiers. The country code and repository code may exist within the content (which is displayed), or they may exist as ‘attributes’ – which provide additional information that is not part of the content (and which can be displayed, but may not be), or they may exist as both. Occasionally they are not present at all.

For those that are familiar with XML markup, I mean that we could display a reference such as ‘GB 0982 UWA’, but there are various ways the data may be structured:

(1) <unitid countrycode=”gb” repositorycode=”0982″ identifier=”UWA”>GB 0982 UWA</unitid>

(2) <unitid>GB 0982 UWA</unitid>

(3) <unitid countrycode=”gb” repositorycode=”0982″>UWA</unitid>

(4) <unitid>UWA</unitid>

Even if you are not familiar with XML, you can see the way that the content is the same (apart from the last example) but the way it is structured differs. However, as long as we can display ‘GB 0982 UWA’ on the Archives Hub we are OK with this. We have ensured that our stylesheet copes with a number of different options, bearing in mind this is just for what displays through a Web browser.

c) Identifiers for the Archive at Lower Levels

On the Hub,  lower levels are assigned persistent identifiers in a similar way to collections. A component’s identifier is that of its parent record (i.e. the content of the eadid tag), followed by a hyphen, then the unitid of the component.

There is no lower-level eadid – the only identifier at lower-levels is the unitid. So, we use the collection-level identifier for the description along with the lower-level identifier for the archive as the unique identifier for the lower level description (eadid + lower-level unitid):

<eadid countrycode=”GB” mainagencycode=”1234″ identifier=”JaB/A”>GB 1234 JaB/A</eadid>

and lower-level

<unitid> GB 1234 JaB/A/3/1</eadid>

Would have the URI of:

http://archiveshub.ac.uk/data/gb1234jab-a-gb1234jab-a-3-1

Maintaining the Distinction Between Identifiers

The Hub has, in a sense, tended to conflate the identifier for the archive with the identifier for the finding aid that describes the archive. Not in the sense of what their function is within the Hub, but more in the way that we took the decision to recommend that these two identifiers should be the same, which makes good sense most of the time. This means that it can be harder to convey the different purposes of the two identifiers. For most of the time this is not a problem, but I think many archivists do think that they are creating one identifier, and don’t think about the distinction between identifying an archive and the description of an archive.

Creating Linked Data Identifiers

Within our Linked Data the identifiers for the descriptions are the Archives Hub URI’s, so that we link back into the Hub from the Linked Data.  The challenges are around the URIs for the archive collection or sub-collection.

Initially we used the eadid content in the URI for the archive collection identifier, so for example:

eadid reference = GB 1234 JaB
URI = http://data.archiveshub.ac.uk/id/archivalresource/gb1234jab.

The ‘JAB’ may also be the identifier for the archive, but in this case it is being used as a unique identifier for the description.

The reason that we used the eadid was because it is definitely unique. The Hub requires all eadid’s to be unique. However, there are two main issues with this:

1) The eadid is the identifier for the description, not the archive collection.

2) Sometimes the agency that maintains the description is not the same as the agency that holds the archive. This is reflected in a different code used in the eadid (to reflect the agency that created the description) and the unitid (to show the repository that holds the archive).

For example:
eadid = GB 133 PPL – an archive maintained by ’133′, which is the John Rylands Library
unitid= GB 135 PPL – an archive stored at ’135′,  The Methodist Archive (part of the Library, but a separate entity)

In this case, we have to maintain the difference between the eadid and unitid because they are telling us different things.

It was for these reason that we felt we should create the URI for the archive from the identifier for the archive, which is the content of the ‘unitid’ tag.

URIs for the Archive at Collection Level

At the top level, things can seem relatively straightforward. Examples of unitid:

GB 1086 Skinner
GB 0532 cwlmga

These are neat examples, and we can translate these into nice Linked Data URIs for the archival resource:

http://data.archiveshub.ac.uk/doc/archivalresource/gb1086skinner

http://data.archiveshub.ac.uk/doc/archivalresource/gb0532cwlmga

URIs for the Archive at Lower Levels

For lower-levels the unitid entries can be quite complicated, although they should work fine if the country code and repository code are included in some way:

GB 2008 TAS1/1/1
could be:

http://data.archiveshub.ac.uk/doc/archivalresource/gb2008tas1-1-1

But on the Hub, as I have said, we combine the eadid for the top level with this lower level  unitid, for reasons to do with trying to ensure the reference is unique and ensuring that the country code and repository code are incorporated, so

eadid: GB 2008 TaS
unitid: GB 2008 TAS1/1/1
would be:

http://data.archiveshub.ac.uk/doc/archivalresource/gb2008tas-gb2008tas1-1-1

or
eadid: GB 2008 TaS
unitid: TAS1/1/1
would be:

http://data.archiveshub.ac.uk/doc/archivalresource/gb2008tas-tas1-1-1

Problems with Using the Unitid for the Identifier

1) Attributes for the Country and Repository

The fact that the unitid doesn’t always contain the country code and repository code, or they may be present as attributes or as content, or both, is problematic for Linked Data, where the identifier for the archive needs a unique URI and the attributes help to provide this.

2)  Maintaining the Distinction between Identifiers

As set out above, the Hub has recommended that contributors use the same identifier for the finding aid as the identifier that describes the archive (unless the agency is different). The different function of these two identifiers is still preserved within the Hub, but using the same data for them works perfectly well.

However, we should not actually assume that the eadid’s mainagency code and the unitid’s repository code are the same. This is because the code for the eadid agency refers to the agency responsible for the description, and the code for the unitid refers to the repository responsible for the archive. They are usually the same, but they are not necessarily the same. If we want to make statements about our content, such as ‘this archive is held at this repository’ and ‘this person was responsible for creating this description’ then the distinction becomes important as they may be different agencies.

The agencies can also change over time. So, if you take the example of the Papers of Dr Thomas Coke. These papers could be held at the Methodist Archive (code 135) and described by a minimal EAD doc maintained by the Methodist Archive so eadid/mainagencycode = unitid/repositorycode = 135

But then at some point maybe John Rylands University Archive revises the description and extends it (maybe the first one was just collection-level and now it is multi-level).  So eadid/mainagencycode = 133 and unitid/repositorycode = 135

The archival stuff hasn’t changed, and it is still held/curated by Methodist Archive, so the URIs of the archival stuff shouldn’t change, even where the description is attributed to a different repository. This means that we shouldn’t rely on eadid content/attributes in constructing URIs for the holding repository.

This whole situation can be complicated by the fact that sometimes the unitid does not contain the repository code at all, so the only code we have is from the eadid, and we have to assume they are the same agency.

3) Inconsistencies in the Data

As stated above, the unitid does not always contain attributes, and so entries vary quite widely. This is mainly as a result of data coming into the Hub from many different sources over a period of time. Many descriptions were created in other systems, and it is always a challenge to move data between systems and end up with something consistent and fit for purpose. Many descriptions were created in something like Word originally, and so issues such as unique identifiers for URIs were not in the game plan at the time. In general, the eadid entries are more consistent and easier to work with than the unitid entries for multi-level descriptions.

4) Unitid may not be Unique

We have hit problems with the untid not being unique throughout the Hub, mainly for lower-level descriptions, and this is the most significant problem. For the Hub, the only identifier that has to be unique is the identifier for the description, the eadid, because this is what the Hub works with – the Hub essentially works with the description.

Process to Identify Duplicate Unitid’s

Pete created an analysis of unitid content and attributes using XSLT, a nifty piece of work that allowed me to see exactly where the duplicate identifiers are. We found that in general duplicates apply to both the raw EAD and the identifiers created by the Archives Hub. But the transformation process, whereby the Hub converts to lower case and uses ‘-’ instead of ‘/’,  can create a duplicate where it did not exist in the raw EAD, such as for StoneR and Stoner (both become ‘stoner’), or for MS/1 and MS-1 (both become ms-1).

With this spreadsheet I was able to order the entries by ‘clashes’ within the identifiers and then decide how to tackle them.

Case Distinction

The ‘Stoner/StoneR’ problem is tricky. One way to get round it would be to go back to URIs that are case sensitive, but this does seem like a retrograde step. Another option would be to work with the contributors to avoid using case as a means to distinguish references, and I think this is what we will do. I think they will be happy to work with us on this, so that we can stick to lower-case URIs but avoid duplication.

Distinguishing Repository Code and Reference

There is an issue with identifiers where the local reference starts with a number, e.g. http://archiveshub.ac.uk/data/gb1067esw.  It looks like the repository code is ’1067′ but actually it is ’106′ and the reference is ’7ESW’.  This could potentially be a problem if the repository codes of  ’106′ and ’1067′ were both used on the Hub.

We are starting to think about converting to URIs with hyphens to show the three parts of the reference, e.g. http://archiveshub.ac.uk/data/gb-106-7esw.

Former Reference

We had to deal with filtering out ‘former reference’ which many contributors include as a second unitid entry, particularly where the descriptions come from other systems.  This is usually a relatively straightforward process, although slightly complicated by the fact that in a few instances the attribute value ‘former reference’ isn’t used. E.g:

<unitid label=”former reference”>http://archiveshub.ac.uk/data/gb123abc</unitid><unitid>Former reference: http://archiveshub.ac.uk/data/gb123abc</unitid>

The latter example can create problems for us. However, as with so many things, there is one further issue here: some contributors actually want the former reference (what they might call ‘alternative reference’) to be the current reference – in this case we are stumped! The only option would be to edit the Hub descriptions prior to creating the Linked Data.

Changes to References

In the end, even once we face all of the other issues with identifiers, we also have the risk that an archive repository will choose to change its reference. This is not common, but it does happen. We could probably find a way to ‘archive’ the old reference and have some kind of link between the two. The horror scenario would be if the repository then used the old reference, which to them may now be redundant, for a new accession.

Reasons for Duplication

The unitid is usually duplicated due to human error, rather than as part of the system of cataloguing.  It isn’t surprising when you consider the level of detail and length of some descriptions, and that they are created with various software applications, some of which don’t really help with creating good identifiers. You find entries with references such as:  JAB/1/2(i)/2/2, JAB/1/2(i)/2/3, JAB/1/2(i)/3, JAB/1/2(i)/5. Once you identify where the mistake is, you can simply correct it.

In one interesting example for a large collection, the duplicate identifier was purposely used, because the higher level entry described a person and the entry below that described the ‘stuff’ about them. So, something like ABC/JB for an entry about John Bunn in the ABC collection, and ABC/JB for a further entry for a file of stuff about John Bunn. You can see how in the display this works OK, although it means each entry has the same URI, but in terms of Linked Data it is a problem.

Conclusion

Our plan at present is, as far as possible, to correct the data, rather than try to work around it. This means some ‘persistent’ URIs will change, but it seems worth the end result of ensuring unique identifiers. We will have to run the analysis on all of our data in order to pick up any issues that need addressing. This has the advantage that we can also pick up other problems with the identifiers, such as the odd incorrect repository code, or attributes absent from eadid entries.

It has been crucial for Pete, who has done the processing, to really understand how URIs are constructed, and in doing this it has become apparent how our system has been skilfully developed by John Harrison, our developer, to cope with many variations, as you have to with a large aggregator.  It would have been extremely difficult to ensure a level of rigour in the data initially, when we were taking in such a diversity of content and focussing on building up the service. In addition, we could not plan for ‘cool URIs’ or, of course, Linked Data.

For Linked Data, rigour is important because you are drawing out so many different entities within the world you are describing, and they all need unique and dereferenceable http URIs. The question is, do you put the effort into introducing more rigour into your source data – is it worth the investment? It is certainly often time-consuming and can be quite a difficult process with so many variables to think about. I think that it generally is worth doing, because more consistent data with properly constructed identifiers must be a good thing, not just for Linked Data but for the whole open and interoperable agenda.

Posted in barriers, data processing, identifiers | 2 Comments

Archives & Linked Data Meetup

With the various Linked Data projects that are happening around Linked Data in archives, I thought the time seemed right to get together and share what we have done, what we think and what we see as the challenges around this work. The Locah project team were particularly keen to find out what the AIM25 team had been up to with their Open Metadata Pathway and Step Change projects, but it seemed well worth broadening the invite out to other projects, so I put out an invitation to other projects and through the archives email list. In the end we had several projects represented:

Locah / Linking Lives: a project to create Linked Data for EAD archive descriptions on the Archives Hub. We have provided a stylesheet that converts EAD to XML RDF as well as a Sparql endpoint and Linked Data views of our data. We did a significant amount of work around data modelling, the use of RDF vocabularies and creating external links, and we blogged in detail about the processes and issues involved in creating Linked Data from complex hierarchical archive descriptions. We are now working on an interface to show how the Linked Data can be used to bring different resources together.

Open Metadata Pathway / Step Change : Work around the use of OpenCalais and the UKAT thesarus (subjects and names) to extract entities from data and enable URIs to be created. A tool is being developed to allow archivists to do this at the time of cataloguing and the project is working with Axiell CALM to embed this into the CALM software and display via CALMView.

SALDA: A project to output the Mass Observation archive as Linked Data and enhance the data. This built on the Locah work.

Bricolage: will publish catalogue metadata as Linked Open Data for two of its most significant collections: the Penguin Archive, a comprehensive collection of the publisher’s papers and books; and the Geology Museum. It intends to build on previous work done by Locah and SALDA.

Trenches to Triples: will provide Linked Data markup to both collection level descriptions and item level catalogue entries relating to the First World War from the Liddell Hart Centre for Military Archives. It will also provide a demonstrator for using Linked Data to make appropriate connections between image databases, Serving Soldier, and detailed catalogues.

The majority of those attending represented these projects. There was also representation from the DEEP project to digitise English place names and make them available as structured data. Other attendees represented the museum sector and The National Archives. In addition, we brought developers together with archivists and managers, and I think we managed to strike a good balance in our discussions so that they were of benefit to everyone.

In the morning we shared information about our projects. This gave us a chance to ask questions and get a clearer understanding of what people have been doing in this space. A number of issues were presented and discussed.

Extracting concepts from data

We were given a demonstration of the prototype cataloguing interface developed by the OMP project and now being developed under Step Change. It uses OpenCalais to extract concepts from archive descriptions, which tend to be quite document-centric, and contain large chunks of text, particularly in the biographical history and the scope and content sections. The idea is to provide URIs for these concepts, so, for example, OpenCalais highlights a name within a paragraph of text, such as ‘Architectural Association’, and you can then confirm that this is an organisation and it is relevant as an index term so that it is marked up appropriately.  The tool is being tested with archivists because ease of use is going to be key to its success. We discussed the limitations of the OpenCalais vocabulary – it does really clever data analysis, but it isn’t geared up for historical data sources. UKAT is much broader and more suitable for archive descriptions – it would be good to integrate this vocabulary into OpenCalais.

Solving the challenges of archive descriptions

We discussed some of the challenges that Locah has faced with processing multi-level archive descriptions, challenges such as: duplicate identifiers for different resources (especially within the same description – more to come on this issue on the Linking Lives blog); creating URIs for data such as extent of archive (where you may have ’10 boxes’, but you may also have ’2 boxes, 10 large photographs and a reel of film’); inheritance of data from the top level down through the levels of a description (which is problematic within Linked Data) and matching names on the Archives Hub to names on VIAF (which we’ve had reasonable success in doing, though in archives names can be quite problematic, such as ‘John Davis, fl. 1880-1890′).

Working with one dataset versus working with a large aggregation

We thought about the comparison between creating a Linked Data output for the Archives Hub, which aggregates data from hundreds of archives, and creating it for just the Mass Observation Archive. Whilst the scale you are working with is appealing with a large aggregator (potential to create Linked Data for all these repositories), working with a discreet collection gives you more control to be able to interpret things within the data (for example, the date may always appear in a certain place, so you can confidently mark up the date as a date and give it a URI).

This led us into some discussion around the way that creating Linked Data can really highlight problems within the data source, and it may provide impetus to address these problems, thus improving the source data by making it more consistent or clarifying the meaning of elements within the data.

Integrating Linked Data into the Workflow

The Step Change project is particuarly focussed on the challenge of making this kind of semantic markup easy to achieve. It has to be well-received by cataloguers. Work is being undertaken at Cumbria Record Office to test the tool out and provide feedback. We discussed the importance of major players such as Axiell CALM embracing this kind of approach, enabling Linked Data to be created from CALM descriptions. This is not yet happening, but the Step Change project is working with CALM and so it is a good starting point. We also discussed the need for the CALM user group to think about whether they want this kind of Linked Data output from their software provider (it needs to be demand-led).

The ‘Same As’ Issue

We touched on the issues around trusted data a number of times. The SALDA project found that creating ‘same-as’ links was probably the most challenging part of the project. We agreed that we must be aware of the importance of archive descriptions being trusted sources and there has been a tendency for some data providers to use ‘same-as’ links too promiscuously. In a Linked Data context this is problematic, as you are then asserting that all of the statements made by the data source you are linking to through this relationship are true.  It raises the issue of manual matching as a means to be sure your links are semantically correct, but doing this is time-consuming, so it can only be carried out in a minority of cases.

* * *

In the afternoon we had two sessions, (i) techniques and tools, and (ii) opportunities and barriers. A brief summary of some of the points that were made during the discussions:

Benefits of Linked Data

  • The principle of generic APIs – a standard way to provide access to data – it could replace the myriad of bespoke APIs now available.
  • Dataset integration – bringing data sources together.
  • The precision of information retrieval and giving researchers more potential to ask specific questions of data sources. For example, a researcher may be able to retrieve information around a specific event.
  • Embedding the expertise of practitioners was seen as something that should be a benefit, i.e. we should ensure that this happens.
  • It encourages cross-domain working.
  • It enables people to create their own interfaces and tools to utilise data sources.
  • It encourages the creation of narratives through diverse data sources.
  • It is very much an ‘anti-silo’ approach to data.

Challenges

  • Expertise required.
  • Need to clearly show the extra value it brings (e.g. above what Google offers).
  • Need clearer understanding of end-user benefits
  • Sustainability and persistence (we talked about the idea of a ‘URI checker’ and using caching to help with this).
  • Possible overload caused by large-scale searches or ‘bad’ Sparql queries.
  • Licensing, including restrictions on external data that you might want to link to within your data.
  • Choice of so many vocabularies.
  • Likelihood of not following the same kinds of practice, thus impeding the linking possibilities between datasets.

Conclusions

The group felt that a recommended approach for archive descriptions would be really useful, to facilitate others outputting Linked Data and ensure we do get the benefits that Linked Data potentially offers.

We talked about a generic stylesheet – the community has already benefitted from the data model and stylesheet developed by the Locah project, with AIM25′s OMP project and SALDA both using it for their projects, and Bricolage looking at it for their Linked Data work,  but there are still issues around  the diverse nature of the data, so a stylesheet to transform EAD descriptions to RDF XML may be a great start for many projects, but modifications are almost inevitable, and expertise would be required for this.

We did decide that a possible way forwards would be what one attendee called a ‘lego approach’, where we think in terms of building blocks towards Linked Data. The idea would be to work on discreet parts of a data model and to recommend best practice. For example, one area would be the relationship between the resource and the index terms or access points. Another would be the relationship between the resource and the holding institution.

This approach should be cross-domain, bringing archives together with museums and libraries.  We could look at parts of the model in turn and decide what the relationships are, whether they are consistent across the domains and which vocabularies would be appropriate. The idea would be to end up with a number of ‘RDF fragments’ that people could use, but with the flexibility to extend the model to meet their requirements.

We are hoping to discuss this proposal more and think about what would be required in order to achieve this kind of co-ordinated approach. Obviously we would immediately need to get buy-in across the three domains. Our meeting was representing archives, but this approach would require a very collaborative effort. However, the advantages are very clear, and it does seem to achieve a balance between the challenges of a completely interoperable solution versus the disadvantages of each domain working out different models and using different vocabularies.

 

 

Posted in barriers, benefits, cross-domain, linked data | 2 Comments

Designing an Interface: some first thoughts

One of the aims of the Linking Lives project is to demonstrate the value of Linked Data through the creation of an end-user interface that pulls in content from the Hub Linked Data, including the external data sets we are linking to. The Linking Lives interface will be part of the Archives Hub service, that is to say, available from within the Hub website. We will present it as a beta service; something that is usable and useful, but also in a state of development. With the provision of this interface, we can start to build up an understanding of how valuable this type of name-based resource is for researchers. We will be able to monitor use as well as carrying out an evaluation to ask researchers what they think of the site. This is far preferable to positing benefits based upon potential, which is tending to happen too much with Linked Data at present.

This post is written from a non-technical perspective and covers a few of the areas that we are currently thinking about, as we start to set out our interface design.

Priorities

We will be concentrating on development of the interface, rather than prioritising scale for this project: quality rather than quantity you might say, although we expect to have some thousands of records included. This is partly pragmatic, because we are still finding challenges over integrating EAD data (Archives Hub descriptions) into our Linked Data because of inconsistencies and sometimes problematic content. The problems that we face with variable data are ongoing, and maybe highlight a basic issue with Linked Data: it works best with consistent data-centric information, and not so well with archival descriptions, built up over decades, many created before there were any standards at all to adhere to. However, on the positive side, our Linked Data work has enabled us to highlight and deal with many data issues, which is beneficial in the long run for any data processing that we might do (or that others might do).

Our focus for this project is on the Linking Lives pages themselves, and what researchers can access from there, so we will not be prioritising the creation of different search options into the data: this would be a next stage, once we get a clearer idea of the use of the interface.

Archives Hub Branding and Navigation

We want Linking Lives (LL) to be recognisably part of the Hub, although it would be premature to try to fully integrate the two. As yet, we don’t know how users  will respond to what we are proposing, and we need to evaluate what we are doing before taking it further into service. We are carrying out an evaluation as part of the project: we will be asking a small group of researchers questions about the current Hub interface, and following this up with some focus group work to get reactions to our new LL interface. This will help us in understanding user requirements.

Linking Lives will be an interface available within the Archives Hub site, but we propose to incorporate data other than archival descriptions within the page. This does raise questions about the clarity of what we are doing and the balance between the different data sources. If we strongly brand the page as Archives Hub, will researchers expect to access just archives, and not other information resources? Will they assume all of the sources are held by us, or that we are responsible for them? If we include the basic Hub navigation at the top of the page, will that actually confuse users, as they may click on links that take them into the main Hub search without realising that LL and the Hub are somewhat different?

We are looking at creating a sub-brand of the Hub as a possible way to identify LL as part of the Hub, but still distinct from it to some extent. This may help to distinguish between the two different applications. We will use the basic Hub logo, but modify it to signify something different. We do want to keep the links between the two, as we believe that researchers will benefit from this, and we do want to bring archives and other data sources together to provide a fuller context, and not make them too distinctly separate. The idea is to enable researchers to move seamlessly from archives described within the Hub to other resources, and take a fairly bold approach to integration, otherwise we will not get the benefits we are after. I am somewhat reminded of The National Archives’ initiative called ‘Your Archives‘, which is a Wiki for community content that it does seem to have remained rather separate from the main TNA catalogues, and maybe that has been to its detriment in terms of profile and use (I often have trouble finding links to Your Archives from within TNA’s website).

Broad Appeal

The LL interface, like the Hub itself, will not be aimed at subject specialists or expert users. It will primarily be aimed at academic researchers, but is intended to appeal to a broad audience: anyone who might be interested in undertaking research. This means that we need to avoid making assumptions about knowledge. Our ‘designated community’ may not have prior knowledge of archives and certainly won’t have knowledge of Linked Data. So they may not know how archives are organised, what an archival ‘biographical history’ is, what an archival creator is, or what ‘same as’ links are between different data sources.

Our aim, therefore, is to incorporate these things in a way that makes sense and makes the person the primary focus of the page, so that it is easy to see that a page is about George Bernard Shaw, for example, and it provides life dates, descriptive information, biographical information, an image or two, aliases for the same person, etc. It is information you might expect to find, or information that makes sense within the context of a page about a person.  At the same time we are keen to ensure that we capture provenance, and so this adds another dimension. Starting to include the source of each piece of information could clutter the screen and so we will need to think about how best to incorporate it. We believe that it will be important to some users, as it could have implications for the quality and accuracy of the data. It is something we would be pleased to see others do for our data, if they were presenting it in a Web interface.

The BBC Example

Our interface will combine content from different sources. We would like to draw in content, in a similar way to the BBC (on the BBC page for Stevie Wonder you can see how the Wikipedia biography is pulled into the page). The BBC page pulls in some of the Wikipedia biog, and provides a link to to go Wikipedia and read more. This helps to make clear that the information comes from elsewhere. With MusicBrainz, another Linked Data source, the BBC provide a link to the MusicBrainz site, but also, further down their page, they state: “Links & information come from MusicBrainz. You can add or edit information about Stevie Wonder at musicbrainz.org.” The information includes personal and business relationships, such as ‘child of’ and ‘collaborated on’.

On the BBC page, the Wikipedia information is more clearly labelled as being from that source; the MusicBrainz information is also identified, but in a less obvious way. But for this, they are not only declaring where the information comes from, they also also invite people to edit the information themselves.

LL will be a useful resource in itself, but can also be a starting point, in much the same way as the BBC provides a page that gives substantial information on a musician or an animal they are interested in, but also invites people to move away from the site to other resources. This in itself is an interesting shift of focus. Long gone are the days when some sites actually disabled the ‘back’ button, and now we are moving towards an even more fluid world, if this type of approach continues to gain traction, where we are not always trying to keep people on our pages, but are actually encouraging them to move around the ‘Web of Data’.

Focus on Expectations

Looking at the BBC page on Stevie Wonder again, one thing that I notice is that it is quite busy. There is a good deal of information, with various boxes and loads of links and options for the user. There does seem to be a trend towards busier pages now, maybe an indication that people are increasingly adept at finding their way through information online, so a certain level of complexity is acceptable. Also, the page is quite long. The BBC page about mammals  is similarly long and complex: introduction, links to other pages on mammals, distribution, classification, BBC news, video, information elsewhere, size ranges, the Wikipedia ‘about’ page, etc. Yet the page does not seem cluttered or difficult to navigate. This is partly because of use of plain language, as well as BBC expertise in web design. It may also be that expectations largely match reality: users may expect the BBC to provide a wealth of information, and they generally know what they will get if they go to ‘programmes’ or ‘video’ or ‘news’ pages.

Expectations do play an important part in good Web design, and maybe it is easier if you are a very well known provider, as the expectations people have are clearer? Many people come into a page through a search engine, so you cannot expect they will have used your homepage, and picked up information via this route. However they arrive at a BBC page, most people know what the BBC is. But arriving at an Archives Hub Linking Lives page, you probably have little idea of the provider in this case, and you may not be clear about what archives are in this context.

We chose to create a biographical resource partly because this would provide a focus; we can convey the fact that the page is about one person relatively easily. This makes it easier in some ways that working the Archives Hub itself, which doesn’t have that kind of focus.  If we provide a page with a whole range of links to various types of biographical content, then we should be able to convey what the page is about fairly easily. It may be that good clear and simple headings and relevant content (about one subject – in this case one person) is better than providing explanations about what you are and what you are trying to provide, as people don’t tend to read help pages.

A ‘Controlled’ Experience

Our interface will use the external data sources within our data, and will be designed in order to give users a controlled experience, in the sense that we are  evaluating the sources we include and presenting the interface in a very defined way. Of course, we cannot control the content of the external data; I am just talking about the way we present it.

An alternative approach would be to pull in all the data that can be found on a topic and display it. Maybe this is the ideal for Linked Data – the ability to bring in any data sources on a topic – but we are quite some way, it seems, from presenting this in a way that end users will want to use. Try a search on Hakia, a semantic search engine (not directly about consuming Linked Data, but about pulling in related information in a more semantic way). I looked for Beatrice Webb, and got a substantial amount of information from a very diverse range of sources, including news, blogs, twitter, images and video. It’s quite impressive in principle, and could be really useful for a researcher, but the net is cast very wide, so it’s not easy to process all of this varied information. Sig.ma describes itself as a semantic information mash-up. If you take a look at the page that sig.ma provides for Beatrice Webb, a substantial amount of data is pulled in, but it is not very user-friendly, not always very coherent and sometimes not relevant. Obviously it is just a demonstrator, and I would say it is for a different audience, with more expertise in Linked Data. It does show the potential for this type of approach, that draws in a really diverse range of data on on-the-fly, but it also shows how semantic searching is complex and difficult to achieve within a user-friendly interface.

The Linking Lives Unique Selling Point

Sites like Wikipedia have biographical pages, and we can never compete with them, so what can we offer that is of value? Essentially, our focus is on meeting the needs of those who want to carry out more in-depth research and who are likely to use primary sources. It may not be people who know they want to use primary sources, it may actually be a means to bring people to archives for the first time (we know that a large proportion of Archives Hub users are first time users of the Hub, and have not necessarily used archives before). We want to make primary sources the focus, but at the same time put them within the context of a whole range of information sources about a person, so that they are not held apart as somehow different and not for mainstream researchers.

It is also worth pointing out that our interface will still in some sense be a demonstrator – it will provide one option for presenting our Linked Data, but the data is there for others to create their own interfaces, and the Sparql endpoint is there for people to query the data in the ways they want to.  In addition, we can re-expose the data that we present. So, there are several purposes here: benefiting end-users, evaluating a name-based approach and putting archives within a broader context, demonstrating the sort of interface that can be provided from Linked Data and possibly re-exposing the data to create more potential benefit.

 

 

 

 

 

 

 

 

 

 

 

 

 

Posted in archival context, branding, interface | 4 Comments

One Person in Context – Working with Biographical Histories

I have been starting to think about the user interface for Linking Lives. We will probably go for something quite simple in terms of layout, because there is quite a bit of complexity when bringing together a range of data sources.

It may be thought that integrating the external data sources is the challenge, but I think that it is probably more of a challenge to integrate several archival descriptions into one biographical record and also to convey the context of the archival descriptions clearly.

In this post, I am focusing on that often very useful field of information, the biographical history.  This is a field that is used to help place the archives in their context, by providing significant and relevant information about their creator(s). It is widely used in archives, although there are increasingly moves to exclude this information from the actual collection description and provide it separately. There are a few observations worth making about this field:

  • In general, it is considered good practice for the biographical history to be appropriate to the records being described. So, you don’t include a full life story when you are describing one letter relating largely to one event in a person’s life….
  • …but this guidance is not always adhered to, so some biogs are long and detailed for a small and discreet collection, others are very brief, even though they may relate to an archive that spans the individual’s entire life.
  • Some repositories will use largely, or entirely, the same biog for different collections about one individual, others will create very distinct biogs, and some may use biogs that have  been created by other institutions.
  • Some biogs will involve a significant amount of research, with the archivist drawing on the unique sources they are cataloguing to provide information that may then be quite unique in itself, making this field particularly useful for researchers.

I am going to use the example of Martha Beatrice Webb here, a significant figure in history, and one with plenty of archival sources that relate to her.

Photo of Martha Beatrice Webb

From the LSE collection on Flickr

On the Hub we have 14 collections where Beatrice Webb is the ‘creator’ or co-creator of the archive (for information on archival creators see a post on the Hub blog, Who is the creator?).  These collections are from three different archive repositories. Here is a selection of the biographical histories (not all yet available from our Linked Data store):

Beatrice Webb (1858-1943), nee Potter, social reformer and diarist. Married to Sidney Webb, pioneers of social science. She was involved in many spheres of political and social activity including the Labour Party, Fabianism, social observation, investigations into poverty, development of socialism, the foundation of the National Health Service and post war welfare state, the London School of Economics, and the New Statesman.
(from A summer holiday in Scotland)

Beatrice Webb (1858 – 1943). Fabian Socialist, social reformer, writer, historian, diarist. Wife, collaborator and assistant of Sidney Webb, later Lord Passfield. Together they contributed to the radical ideology first of the Liberal Party and later of the Labour Party.
(from Letters)

The role of the Reconstruction Committee involved ‘…surveying and unravelling the whole tangle of governmental activities’ introduced during World War I (1914 – 1918). It was established in early 1917 but by July 1918 had been disbanded, Webb reporting that its ‘…machinery was too rickety to survive’.
(from Webb Beatrice 1858-1943 nee Potter)

Beatrice and Sidney Webb were pioneering social economists, early members of the Fabian Society and co-founders of the London School of Economic and Political Science, and had a profound effect on English social thought and institutions. Beatrice Potter Webb was born in 1858, the eighth daughter of Richard Potter, a wealthy businessman, and Lawrencina Heyworth. Surrounded from an early age by her parents’ intellectual and worldly friends and visitors, notably the philosopher Herbert Spencer, she was largely self-educated through copious reading, and frequently a partner for her father during business trips abroad. Following a tempestuous relationship with Joseph Chamberlain, which began in 1883 and lasted several years, Beatrice took up social work in London, acting as a rent collector for the Charity Organisation Society, and becoming steadily disillusioned by the inability of charitable organisations to tackle the basic causes of poverty. During 1886, she participated in research for Charles Booth’s investigations into London labour conditions, eventually contributing to Volume I of Life and Labour of the People of London (1889). During this period she continued to write articles on social subjects, most of which were printed in The nineteenth century , and published The co-operative movement in Great Britain (1891). She met Sidney Webb in 1890 during research into economic conditions and labour unions. Sidney Webb was born in London in 1859. Educated in the local academy, he left school at sixteen to work as a clerk in a colonial brokers. By attending evening classes, he passed the civil service exams in 1881 and was appointed a clerk in the Inland Revenue. The following year, he took the Civil Service upper division examination and was appointed to the Colonial Office in 1883. He also began lecturing on political economy at the Working Men’s College. Webb was a close friend of George Bernard Shaw, who induced him to join the socialist Fabian Society in 1885, where both men became leading members: Webb was responsible for putting forward the first concise expression of Fabian convictions in Facts for Socialists (Fabian Tract 5, 1887). As a member of the Fabian executive, Webb continued to write and lecture extensively on economic and social issues, and took a leading role in Fabian policy-making…..…….[cont'd]
(from Webb, Beatrice, 1858-1943 and Webb, Sidney, 1849-1947, social reformers and historians)

If we want to create a biographical page for Beatrice Webb ideally we would have one biog that combines the best of all of the 14 available. However, apart from this being pretty much impossible, we come back to the fact that they are often appropriate to specific collection descriptions. You can see a good example of this above, where the text refers to the ‘Reconstruction Committee’, although the title does not, in fact, tell you that this is what the collection is about.  There are also clearly some issues with two of these titles, which are not really titles at all, but names of creators, but that’s another story…

For researchers, the prospect of trawling through 14 biog entries may not seem very enticing. We do have the option to use one as the default display and then provide links to the others, but then which to pick and why?

So that leaves us with listing all of the biogs along with the collection titles. Possibly a rather unwieldy answer, but on the other hand, it could be argued that this is an improvement on researchers having to click through 14 separate records. It does at least pull the biographical information together to some extent.

In terms of our data modelling, the great thing about Linked Data is that we can decide what to say about entities within the data. For Locah, we have linked bioghist to the agent – so in this case the agent is Beatrice Webb (or Beatrice and Sydney Webb) – and we have also linked it to the ‘Archival Resource’ (the collection itself). We could decide to say that a bioghist is about someone strictly in the context of one archival resource, rather than making a link directly with the agent, but this would probably complicate things too much.

The SNAC project in the US (Social Networks in Archival Context) is working on creating archival authority records, which is a little like our project to create biographical records, but they are using a distinctly archival standard, EAC-CPF, and not incorporating external data within the records (though it may be referenced on their interface). Most of the people on their prototype have only created one collection, which makes life easier, but looking at the entry for Ella Fitzgerald, there are two collections. You can see that both biogs are displayed, and the source for each is given. It is interesting to note with this display how the source is given less prominence, being given in smaller letters at the end of the text. Another example, Royal Chicano Air Force, provides two biogs, but they are both the same apart from a small addition to one, even though the collections are held in different institutions.

I should emphasise that the SNAC interface is a prototype, and I know they will be doing more work on the display, so I’m not out to be critical (I think its a great initiative). But I do wonder whether it is a good idea to display all the biog entries one after the other with not much emphasis on where they come from and hence why there are several of them, maybe with substantial repetition. If they had an entry for Beatrice Webb with our 14 collection descriptions the biog entries would create one very very long page.

I think that we may look at including all of the biog entries, clearly linking them to the collection titles, but possibly only displaying a limited number of words for each, with the option to go to the full entry. That way we can include all of them, give a sense of what each of them provides, and let the user decide where to go from there.

Another avenue we would like to explore is extracting concepts from this data, and maybe that would be a way to start to find common concepts within a number of biogs. But we’ll have to see how far we manage to get with that particular challenge.

 

 

 

Posted in archival context, biographical history, interface | Tagged , | Leave a comment