Unique Identifiers for Archives in a Linked Data World

Our Linked Data work has thrown up a significant number of challenges around the consistency and structure of the source data from the Archives Hub, and nowhere more so than around identifiers for the archival resources, that is, the references used for the archives at all levels of description, be it collection, series, file or item.

Identifiers on the Hub

Identifiers serve two distinct purposes on the Archives Hub:

(i) the identifier for the archive itself – the reference for the actual collection or sub-collection. This is contained within the ‘unitid’ tag.

(ii) the identifier for the description of that archive – the finding aid. This is contained within the ‘eadid’ tag.

a) Identifiers for the Description of the Archive

The eadid tag consistently contains attributes for the country and the agency that maintains the description. This information is also given within the content of the tag.  The Hub URI is created by converting the reference to lower case, and converting slashes to dashes:

<eadid countrycode=”GB” mainagencycode=”1234″ identifier=”JaB/A”>GB 1234 JaB/A</eadid>

becomes

http://archiveshub.ac.uk/data/gb1234jab-a

b) Identifiers for the Archive

We display the identifier for the archive within the description. So, to a degree, the way this identifier is structured in the data is less important, as long as we display it to the researcher.

The identifier for the archive is typically the same as the identifier for the description, including a code for the country and a code for the repository as well as the local identifier for the archive, although they serve different purposes:

Reference: GB 1234 JaB/A

At the top level, things can seem relatively straightforward.But, bear in mind that on the Hub the primary role of the ‘unitid’ reference is to be a visual indicator of the reference – the important thing is what displays to the end-user, so a level of inconsistency in the make-up of the unitid might not be a problem as long as we display the correct reference.

If you look behind the scenes, there is a lack of consistency in the structure of these  identifiers. The country code and repository code may exist within the content (which is displayed), or they may exist as ‘attributes’ – which provide additional information that is not part of the content (and which can be displayed, but may not be), or they may exist as both. Occasionally they are not present at all.

For those that are familiar with XML markup, I mean that we could display a reference such as ‘GB 0982 UWA’, but there are various ways the data may be structured:

(1) <unitid countrycode=”gb” repositorycode=”0982″ identifier=”UWA”>GB 0982 UWA</unitid>

(2) <unitid>GB 0982 UWA</unitid>

(3) <unitid countrycode=”gb” repositorycode=”0982″>UWA</unitid>

(4) <unitid>UWA</unitid>

Even if you are not familiar with XML, you can see the way that the content is the same (apart from the last example) but the way it is structured differs. However, as long as we can display ‘GB 0982 UWA’ on the Archives Hub we are OK with this. We have ensured that our stylesheet copes with a number of different options, bearing in mind this is just for what displays through a Web browser.

c) Identifiers for the Archive at Lower Levels

On the Hub,  lower levels are assigned persistent identifiers in a similar way to collections. A component’s identifier is that of its parent record (i.e. the content of the eadid tag), followed by a hyphen, then the unitid of the component.

There is no lower-level eadid – the only identifier at lower-levels is the unitid. So, we use the collection-level identifier for the description along with the lower-level identifier for the archive as the unique identifier for the lower level description (eadid + lower-level unitid):

<eadid countrycode=”GB” mainagencycode=”1234″ identifier=”JaB/A”>GB 1234 JaB/A</eadid>

and lower-level

<unitid> GB 1234 JaB/A/3/1</eadid>

Would have the URI of:

http://archiveshub.ac.uk/data/gb1234jab-a-gb1234jab-a-3-1

Maintaining the Distinction Between Identifiers

The Hub has, in a sense, tended to conflate the identifier for the archive with the identifier for the finding aid that describes the archive. Not in the sense of what their function is within the Hub, but more in the way that we took the decision to recommend that these two identifiers should be the same, which makes good sense most of the time. This means that it can be harder to convey the different purposes of the two identifiers. For most of the time this is not a problem, but I think many archivists do think that they are creating one identifier, and don’t think about the distinction between identifying an archive and the description of an archive.

Creating Linked Data Identifiers

Within our Linked Data the identifiers for the descriptions are the Archives Hub URI’s, so that we link back into the Hub from the Linked Data.  The challenges are around the URIs for the archive collection or sub-collection.

Initially we used the eadid content in the URI for the archive collection identifier, so for example:

eadid reference = GB 1234 JaB
URI = http://data.archiveshub.ac.uk/id/archivalresource/gb1234jab.

The ‘JAB’ may also be the identifier for the archive, but in this case it is being used as a unique identifier for the description.

The reason that we used the eadid was because it is definitely unique. The Hub requires all eadid’s to be unique. However, there are two main issues with this:

1) The eadid is the identifier for the description, not the archive collection.

2) Sometimes the agency that maintains the description is not the same as the agency that holds the archive. This is reflected in a different code used in the eadid (to reflect the agency that created the description) and the unitid (to show the repository that holds the archive).

For example:
eadid = GB 133 PPL – an archive maintained by ’133′, which is the John Rylands Library
unitid= GB 135 PPL – an archive stored at ’135′,  The Methodist Archive (part of the Library, but a separate entity)

In this case, we have to maintain the difference between the eadid and unitid because they are telling us different things.

It was for these reason that we felt we should create the URI for the archive from the identifier for the archive, which is the content of the ‘unitid’ tag.

URIs for the Archive at Collection Level

At the top level, things can seem relatively straightforward. Examples of unitid:

GB 1086 Skinner
GB 0532 cwlmga

These are neat examples, and we can translate these into nice Linked Data URIs for the archival resource:

http://data.archiveshub.ac.uk/doc/archivalresource/gb1086skinner

http://data.archiveshub.ac.uk/doc/archivalresource/gb0532cwlmga

URIs for the Archive at Lower Levels

For lower-levels the unitid entries can be quite complicated, although they should work fine if the country code and repository code are included in some way:

GB 2008 TAS1/1/1
could be:

http://data.archiveshub.ac.uk/doc/archivalresource/gb2008tas1-1-1

But on the Hub, as I have said, we combine the eadid for the top level with this lower level  unitid, for reasons to do with trying to ensure the reference is unique and ensuring that the country code and repository code are incorporated, so

eadid: GB 2008 TaS
unitid: GB 2008 TAS1/1/1
would be:

http://data.archiveshub.ac.uk/doc/archivalresource/gb2008tas-gb2008tas1-1-1

or
eadid: GB 2008 TaS
unitid: TAS1/1/1
would be:

http://data.archiveshub.ac.uk/doc/archivalresource/gb2008tas-tas1-1-1

Problems with Using the Unitid for the Identifier

1) Attributes for the Country and Repository

The fact that the unitid doesn’t always contain the country code and repository code, or they may be present as attributes or as content, or both, is problematic for Linked Data, where the identifier for the archive needs a unique URI and the attributes help to provide this.

2)  Maintaining the Distinction between Identifiers

As set out above, the Hub has recommended that contributors use the same identifier for the finding aid as the identifier that describes the archive (unless the agency is different). The different function of these two identifiers is still preserved within the Hub, but using the same data for them works perfectly well.

However, we should not actually assume that the eadid’s mainagency code and the unitid’s repository code are the same. This is because the code for the eadid agency refers to the agency responsible for the description, and the code for the unitid refers to the repository responsible for the archive. They are usually the same, but they are not necessarily the same. If we want to make statements about our content, such as ‘this archive is held at this repository’ and ‘this person was responsible for creating this description’ then the distinction becomes important as they may be different agencies.

The agencies can also change over time. So, if you take the example of the Papers of Dr Thomas Coke. These papers could be held at the Methodist Archive (code 135) and described by a minimal EAD doc maintained by the Methodist Archive so eadid/mainagencycode = unitid/repositorycode = 135

But then at some point maybe John Rylands University Archive revises the description and extends it (maybe the first one was just collection-level and now it is multi-level).  So eadid/mainagencycode = 133 and unitid/repositorycode = 135

The archival stuff hasn’t changed, and it is still held/curated by Methodist Archive, so the URIs of the archival stuff shouldn’t change, even where the description is attributed to a different repository. This means that we shouldn’t rely on eadid content/attributes in constructing URIs for the holding repository.

This whole situation can be complicated by the fact that sometimes the unitid does not contain the repository code at all, so the only code we have is from the eadid, and we have to assume they are the same agency.

3) Inconsistencies in the Data

As stated above, the unitid does not always contain attributes, and so entries vary quite widely. This is mainly as a result of data coming into the Hub from many different sources over a period of time. Many descriptions were created in other systems, and it is always a challenge to move data between systems and end up with something consistent and fit for purpose. Many descriptions were created in something like Word originally, and so issues such as unique identifiers for URIs were not in the game plan at the time. In general, the eadid entries are more consistent and easier to work with than the unitid entries for multi-level descriptions.

4) Unitid may not be Unique

We have hit problems with the untid not being unique throughout the Hub, mainly for lower-level descriptions, and this is the most significant problem. For the Hub, the only identifier that has to be unique is the identifier for the description, the eadid, because this is what the Hub works with – the Hub essentially works with the description.

Process to Identify Duplicate Unitid’s

Pete created an analysis of unitid content and attributes using XSLT, a nifty piece of work that allowed me to see exactly where the duplicate identifiers are. We found that in general duplicates apply to both the raw EAD and the identifiers created by the Archives Hub. But the transformation process, whereby the Hub converts to lower case and uses ‘-’ instead of ‘/’,  can create a duplicate where it did not exist in the raw EAD, such as for StoneR and Stoner (both become ‘stoner’), or for MS/1 and MS-1 (both become ms-1).

With this spreadsheet I was able to order the entries by ‘clashes’ within the identifiers and then decide how to tackle them.

Case Distinction

The ‘Stoner/StoneR’ problem is tricky. One way to get round it would be to go back to URIs that are case sensitive, but this does seem like a retrograde step. Another option would be to work with the contributors to avoid using case as a means to distinguish references, and I think this is what we will do. I think they will be happy to work with us on this, so that we can stick to lower-case URIs but avoid duplication.

Distinguishing Repository Code and Reference

There is an issue with identifiers where the local reference starts with a number, e.g. http://archiveshub.ac.uk/data/gb1067esw.  It looks like the repository code is ’1067′ but actually it is ’106′ and the reference is ’7ESW’.  This could potentially be a problem if the repository codes of  ’106′ and ’1067′ were both used on the Hub.

We are starting to think about converting to URIs with hyphens to show the three parts of the reference, e.g. http://archiveshub.ac.uk/data/gb-106-7esw.

Former Reference

We had to deal with filtering out ‘former reference’ which many contributors include as a second unitid entry, particularly where the descriptions come from other systems.  This is usually a relatively straightforward process, although slightly complicated by the fact that in a few instances the attribute value ‘former reference’ isn’t used. E.g:

<unitid label=”former reference”>http://archiveshub.ac.uk/data/gb123abc</unitid><unitid>Former reference: http://archiveshub.ac.uk/data/gb123abc</unitid>

The latter example can create problems for us. However, as with so many things, there is one further issue here: some contributors actually want the former reference (what they might call ‘alternative reference’) to be the current reference – in this case we are stumped! The only option would be to edit the Hub descriptions prior to creating the Linked Data.

Changes to References

In the end, even once we face all of the other issues with identifiers, we also have the risk that an archive repository will choose to change its reference. This is not common, but it does happen. We could probably find a way to ‘archive’ the old reference and have some kind of link between the two. The horror scenario would be if the repository then used the old reference, which to them may now be redundant, for a new accession.

Reasons for Duplication

The unitid is usually duplicated due to human error, rather than as part of the system of cataloguing.  It isn’t surprising when you consider the level of detail and length of some descriptions, and that they are created with various software applications, some of which don’t really help with creating good identifiers. You find entries with references such as:  JAB/1/2(i)/2/2, JAB/1/2(i)/2/3, JAB/1/2(i)/3, JAB/1/2(i)/5. Once you identify where the mistake is, you can simply correct it.

In one interesting example for a large collection, the duplicate identifier was purposely used, because the higher level entry described a person and the entry below that described the ‘stuff’ about them. So, something like ABC/JB for an entry about John Bunn in the ABC collection, and ABC/JB for a further entry for a file of stuff about John Bunn. You can see how in the display this works OK, although it means each entry has the same URI, but in terms of Linked Data it is a problem.

Conclusion

Our plan at present is, as far as possible, to correct the data, rather than try to work around it. This means some ‘persistent’ URIs will change, but it seems worth the end result of ensuring unique identifiers. We will have to run the analysis on all of our data in order to pick up any issues that need addressing. This has the advantage that we can also pick up other problems with the identifiers, such as the odd incorrect repository code, or attributes absent from eadid entries.

It has been crucial for Pete, who has done the processing, to really understand how URIs are constructed, and in doing this it has become apparent how our system has been skilfully developed by John Harrison, our developer, to cope with many variations, as you have to with a large aggregator.  It would have been extremely difficult to ensure a level of rigour in the data initially, when we were taking in such a diversity of content and focussing on building up the service. In addition, we could not plan for ‘cool URIs’ or, of course, Linked Data.

For Linked Data, rigour is important because you are drawing out so many different entities within the world you are describing, and they all need unique and dereferenceable http URIs. The question is, do you put the effort into introducing more rigour into your source data – is it worth the investment? It is certainly often time-consuming and can be quite a difficult process with so many variables to think about. I think that it generally is worth doing, because more consistent data with properly constructed identifiers must be a good thing, not just for Linked Data but for the whole open and interoperable agenda.

Posted in barriers, data processing, identifiers | 1 Comment

Archives & Linked Data Meetup

With the various Linked Data projects that are happening around Linked Data in archives, I thought the time seemed right to get together and share what we have done, what we think and what we see as the challenges around this work. The Locah project team were particularly keen to find out what the AIM25 team had been up to with their Open Metadata Pathway and Step Change projects, but it seemed well worth broadening the invite out to other projects, so I put out an invitation to other projects and through the archives email list. In the end we had several projects represented:

Locah / Linking Lives: a project to create Linked Data for EAD archive descriptions on the Archives Hub. We have provided a stylesheet that converts EAD to XML RDF as well as a Sparql endpoint and Linked Data views of our data. We did a significant amount of work around data modelling, the use of RDF vocabularies and creating external links, and we blogged in detail about the processes and issues involved in creating Linked Data from complex hierarchical archive descriptions. We are now working on an interface to show how the Linked Data can be used to bring different resources together.

Open Metadata Pathway / Step Change : Work around the use of OpenCalais and the UKAT thesarus (subjects and names) to extract entities from data and enable URIs to be created. A tool is being developed to allow archivists to do this at the time of cataloguing and the project is working with Axiell CALM to embed this into the CALM software and display via CALMView.

SALDA: A project to output the Mass Observation archive as Linked Data and enhance the data. This built on the Locah work.

Bricolage: will publish catalogue metadata as Linked Open Data for two of its most significant collections: the Penguin Archive, a comprehensive collection of the publisher’s papers and books; and the Geology Museum. It intends to build on previous work done by Locah and SALDA.

Trenches to Triples: will provide Linked Data markup to both collection level descriptions and item level catalogue entries relating to the First World War from the Liddell Hart Centre for Military Archives. It will also provide a demonstrator for using Linked Data to make appropriate connections between image databases, Serving Soldier, and detailed catalogues.

The majority of those attending represented these projects. There was also representation from the DEEP project to digitise English place names and make them available as structured data. Other attendees represented the museum sector and The National Archives. In addition, we brought developers together with archivists and managers, and I think we managed to strike a good balance in our discussions so that they were of benefit to everyone.

In the morning we shared information about our projects. This gave us a chance to ask questions and get a clearer understanding of what people have been doing in this space. A number of issues were presented and discussed.

Extracting concepts from data

We were given a demonstration of the prototype cataloguing interface developed by the OMP project and now being developed under Step Change. It uses OpenCalais to extract concepts from archive descriptions, which tend to be quite document-centric, and contain large chunks of text, particularly in the biographical history and the scope and content sections. The idea is to provide URIs for these concepts, so, for example, OpenCalais highlights a name within a paragraph of text, such as ‘Architectural Association’, and you can then confirm that this is an organisation and it is relevant as an index term so that it is marked up appropriately.  The tool is being tested with archivists because ease of use is going to be key to its success. We discussed the limitations of the OpenCalais vocabulary – it does really clever data analysis, but it isn’t geared up for historical data sources. UKAT is much broader and more suitable for archive descriptions – it would be good to integrate this vocabulary into OpenCalais.

Solving the challenges of archive descriptions

We discussed some of the challenges that Locah has faced with processing multi-level archive descriptions, challenges such as: duplicate identifiers for different resources (especially within the same description – more to come on this issue on the Linking Lives blog); creating URIs for data such as extent of archive (where you may have ’10 boxes’, but you may also have ’2 boxes, 10 large photographs and a reel of film’); inheritance of data from the top level down through the levels of a description (which is problematic within Linked Data) and matching names on the Archives Hub to names on VIAF (which we’ve had reasonable success in doing, though in archives names can be quite problematic, such as ‘John Davis, fl. 1880-1890′).

Working with one dataset versus working with a large aggregation

We thought about the comparison between creating a Linked Data output for the Archives Hub, which aggregates data from hundreds of archives, and creating it for just the Mass Observation Archive. Whilst the scale you are working with is appealing with a large aggregator (potential to create Linked Data for all these repositories), working with a discreet collection gives you more control to be able to interpret things within the data (for example, the date may always appear in a certain place, so you can confidently mark up the date as a date and give it a URI).

This led us into some discussion around the way that creating Linked Data can really highlight problems within the data source, and it may provide impetus to address these problems, thus improving the source data by making it more consistent or clarifying the meaning of elements within the data.

Integrating Linked Data into the Workflow

The Step Change project is particuarly focussed on the challenge of making this kind of semantic markup easy to achieve. It has to be well-received by cataloguers. Work is being undertaken at Cumbria Record Office to test the tool out and provide feedback. We discussed the importance of major players such as Axiell CALM embracing this kind of approach, enabling Linked Data to be created from CALM descriptions. This is not yet happening, but the Step Change project is working with CALM and so it is a good starting point. We also discussed the need for the CALM user group to think about whether they want this kind of Linked Data output from their software provider (it needs to be demand-led).

The ‘Same As’ Issue

We touched on the issues around trusted data a number of times. The SALDA project found that creating ‘same-as’ links was probably the most challenging part of the project. We agreed that we must be aware of the importance of archive descriptions being trusted sources and there has been a tendency for some data providers to use ‘same-as’ links too promiscuously. In a Linked Data context this is problematic, as you are then asserting that all of the statements made by the data source you are linking to through this relationship are true.  It raises the issue of manual matching as a means to be sure your links are semantically correct, but doing this is time-consuming, so it can only be carried out in a minority of cases.

* * *

In the afternoon we had two sessions, (i) techniques and tools, and (ii) opportunities and barriers. A brief summary of some of the points that were made during the discussions:

Benefits of Linked Data

  • The principle of generic APIs – a standard way to provide access to data – it could replace the myriad of bespoke APIs now available.
  • Dataset integration – bringing data sources together.
  • The precision of information retrieval and giving researchers more potential to ask specific questions of data sources. For example, a researcher may be able to retrieve information around a specific event.
  • Embedding the expertise of practitioners was seen as something that should be a benefit, i.e. we should ensure that this happens.
  • It encourages cross-domain working.
  • It enables people to create their own interfaces and tools to utilise data sources.
  • It encourages the creation of narratives through diverse data sources.
  • It is very much an ‘anti-silo’ approach to data.

Challenges

  • Expertise required.
  • Need to clearly show the extra value it brings (e.g. above what Google offers).
  • Need clearer understanding of end-user benefits
  • Sustainability and persistence (we talked about the idea of a ‘URI checker’ and using caching to help with this).
  • Possible overload caused by large-scale searches or ‘bad’ Sparql queries.
  • Licensing, including restrictions on external data that you might want to link to within your data.
  • Choice of so many vocabularies.
  • Likelihood of not following the same kinds of practice, thus impeding the linking possibilities between datasets.

Conclusions

The group felt that a recommended approach for archive descriptions would be really useful, to facilitate others outputting Linked Data and ensure we do get the benefits that Linked Data potentially offers.

We talked about a generic stylesheet – the community has already benefitted from the data model and stylesheet developed by the Locah project, with AIM25′s OMP project and SALDA both using it for their projects, and Bricolage looking at it for their Linked Data work,  but there are still issues around  the diverse nature of the data, so a stylesheet to transform EAD descriptions to RDF XML may be a great start for many projects, but modifications are almost inevitable, and expertise would be required for this.

We did decide that a possible way forwards would be what one attendee called a ‘lego approach’, where we think in terms of building blocks towards Linked Data. The idea would be to work on discreet parts of a data model and to recommend best practice. For example, one area would be the relationship between the resource and the index terms or access points. Another would be the relationship between the resource and the holding institution.

This approach should be cross-domain, bringing archives together with museums and libraries.  We could look at parts of the model in turn and decide what the relationships are, whether they are consistent across the domains and which vocabularies would be appropriate. The idea would be to end up with a number of ‘RDF fragments’ that people could use, but with the flexibility to extend the model to meet their requirements.

We are hoping to discuss this proposal more and think about what would be required in order to achieve this kind of co-ordinated approach. Obviously we would immediately need to get buy-in across the three domains. Our meeting was representing archives, but this approach would require a very collaborative effort. However, the advantages are very clear, and it does seem to achieve a balance between the challenges of a completely interoperable solution versus the disadvantages of each domain working out different models and using different vocabularies.

 

 

Posted in barriers, benefits, cross-domain, linked data | 3 Comments

Designing an Interface: some first thoughts

One of the aims of the Linking Lives project is to demonstrate the value of Linked Data through the creation of an end-user interface that pulls in content from the Hub Linked Data, including the external data sets we are linking to. The Linking Lives interface will be part of the Archives Hub service, that is to say, available from within the Hub website. We will present it as a beta service; something that is usable and useful, but also in a state of development. With the provision of this interface, we can start to build up an understanding of how valuable this type of name-based resource is for researchers. We will be able to monitor use as well as carrying out an evaluation to ask researchers what they think of the site. This is far preferable to positing benefits based upon potential, which is tending to happen too much with Linked Data at present.

This post is written from a non-technical perspective and covers a few of the areas that we are currently thinking about, as we start to set out our interface design.

Priorities

We will be concentrating on development of the interface, rather than prioritising scale for this project: quality rather than quantity you might say, although we expect to have some thousands of records included. This is partly pragmatic, because we are still finding challenges over integrating EAD data (Archives Hub descriptions) into our Linked Data because of inconsistencies and sometimes problematic content. The problems that we face with variable data are ongoing, and maybe highlight a basic issue with Linked Data: it works best with consistent data-centric information, and not so well with archival descriptions, built up over decades, many created before there were any standards at all to adhere to. However, on the positive side, our Linked Data work has enabled us to highlight and deal with many data issues, which is beneficial in the long run for any data processing that we might do (or that others might do).

Our focus for this project is on the Linking Lives pages themselves, and what researchers can access from there, so we will not be prioritising the creation of different search options into the data: this would be a next stage, once we get a clearer idea of the use of the interface.

Archives Hub Branding and Navigation

We want Linking Lives (LL) to be recognisably part of the Hub, although it would be premature to try to fully integrate the two. As yet, we don’t know how users  will respond to what we are proposing, and we need to evaluate what we are doing before taking it further into service. We are carrying out an evaluation as part of the project: we will be asking a small group of researchers questions about the current Hub interface, and following this up with some focus group work to get reactions to our new LL interface. This will help us in understanding user requirements.

Linking Lives will be an interface available within the Archives Hub site, but we propose to incorporate data other than archival descriptions within the page. This does raise questions about the clarity of what we are doing and the balance between the different data sources. If we strongly brand the page as Archives Hub, will researchers expect to access just archives, and not other information resources? Will they assume all of the sources are held by us, or that we are responsible for them? If we include the basic Hub navigation at the top of the page, will that actually confuse users, as they may click on links that take them into the main Hub search without realising that LL and the Hub are somewhat different?

We are looking at creating a sub-brand of the Hub as a possible way to identify LL as part of the Hub, but still distinct from it to some extent. This may help to distinguish between the two different applications. We will use the basic Hub logo, but modify it to signify something different. We do want to keep the links between the two, as we believe that researchers will benefit from this, and we do want to bring archives and other data sources together to provide a fuller context, and not make them too distinctly separate. The idea is to enable researchers to move seamlessly from archives described within the Hub to other resources, and take a fairly bold approach to integration, otherwise we will not get the benefits we are after. I am somewhat reminded of The National Archives’ initiative called ‘Your Archives‘, which is a Wiki for community content that it does seem to have remained rather separate from the main TNA catalogues, and maybe that has been to its detriment in terms of profile and use (I often have trouble finding links to Your Archives from within TNA’s website).

Broad Appeal

The LL interface, like the Hub itself, will not be aimed at subject specialists or expert users. It will primarily be aimed at academic researchers, but is intended to appeal to a broad audience: anyone who might be interested in undertaking research. This means that we need to avoid making assumptions about knowledge. Our ‘designated community’ may not have prior knowledge of archives and certainly won’t have knowledge of Linked Data. So they may not know how archives are organised, what an archival ‘biographical history’ is, what an archival creator is, or what ‘same as’ links are between different data sources.

Our aim, therefore, is to incorporate these things in a way that makes sense and makes the person the primary focus of the page, so that it is easy to see that a page is about George Bernard Shaw, for example, and it provides life dates, descriptive information, biographical information, an image or two, aliases for the same person, etc. It is information you might expect to find, or information that makes sense within the context of a page about a person.  At the same time we are keen to ensure that we capture provenance, and so this adds another dimension. Starting to include the source of each piece of information could clutter the screen and so we will need to think about how best to incorporate it. We believe that it will be important to some users, as it could have implications for the quality and accuracy of the data. It is something we would be pleased to see others do for our data, if they were presenting it in a Web interface.

The BBC Example

Our interface will combine content from different sources. We would like to draw in content, in a similar way to the BBC (on the BBC page for Stevie Wonder you can see how the Wikipedia biography is pulled into the page). The BBC page pulls in some of the Wikipedia biog, and provides a link to to go Wikipedia and read more. This helps to make clear that the information comes from elsewhere. With MusicBrainz, another Linked Data source, the BBC provide a link to the MusicBrainz site, but also, further down their page, they state: “Links & information come from MusicBrainz. You can add or edit information about Stevie Wonder at musicbrainz.org.” The information includes personal and business relationships, such as ‘child of’ and ‘collaborated on’.

On the BBC page, the Wikipedia information is more clearly labelled as being from that source; the MusicBrainz information is also identified, but in a less obvious way. But for this, they are not only declaring where the information comes from, they also also invite people to edit the information themselves.

LL will be a useful resource in itself, but can also be a starting point, in much the same way as the BBC provides a page that gives substantial information on a musician or an animal they are interested in, but also invites people to move away from the site to other resources. This in itself is an interesting shift of focus. Long gone are the days when some sites actually disabled the ‘back’ button, and now we are moving towards an even more fluid world, if this type of approach continues to gain traction, where we are not always trying to keep people on our pages, but are actually encouraging them to move around the ‘Web of Data’.

Focus on Expectations

Looking at the BBC page on Stevie Wonder again, one thing that I notice is that it is quite busy. There is a good deal of information, with various boxes and loads of links and options for the user. There does seem to be a trend towards busier pages now, maybe an indication that people are increasingly adept at finding their way through information online, so a certain level of complexity is acceptable. Also, the page is quite long. The BBC page about mammals  is similarly long and complex: introduction, links to other pages on mammals, distribution, classification, BBC news, video, information elsewhere, size ranges, the Wikipedia ‘about’ page, etc. Yet the page does not seem cluttered or difficult to navigate. This is partly because of use of plain language, as well as BBC expertise in web design. It may also be that expectations largely match reality: users may expect the BBC to provide a wealth of information, and they generally know what they will get if they go to ‘programmes’ or ‘video’ or ‘news’ pages.

Expectations do play an important part in good Web design, and maybe it is easier if you are a very well known provider, as the expectations people have are clearer? Many people come into a page through a search engine, so you cannot expect they will have used your homepage, and picked up information via this route. However they arrive at a BBC page, most people know what the BBC is. But arriving at an Archives Hub Linking Lives page, you probably have little idea of the provider in this case, and you may not be clear about what archives are in this context.

We chose to create a biographical resource partly because this would provide a focus; we can convey the fact that the page is about one person relatively easily. This makes it easier in some ways that working the Archives Hub itself, which doesn’t have that kind of focus.  If we provide a page with a whole range of links to various types of biographical content, then we should be able to convey what the page is about fairly easily. It may be that good clear and simple headings and relevant content (about one subject – in this case one person) is better than providing explanations about what you are and what you are trying to provide, as people don’t tend to read help pages.

A ‘Controlled’ Experience

Our interface will use the external data sources within our data, and will be designed in order to give users a controlled experience, in the sense that we are  evaluating the sources we include and presenting the interface in a very defined way. Of course, we cannot control the content of the external data; I am just talking about the way we present it.

An alternative approach would be to pull in all the data that can be found on a topic and display it. Maybe this is the ideal for Linked Data – the ability to bring in any data sources on a topic – but we are quite some way, it seems, from presenting this in a way that end users will want to use. Try a search on Hakia, a semantic search engine (not directly about consuming Linked Data, but about pulling in related information in a more semantic way). I looked for Beatrice Webb, and got a substantial amount of information from a very diverse range of sources, including news, blogs, twitter, images and video. It’s quite impressive in principle, and could be really useful for a researcher, but the net is cast very wide, so it’s not easy to process all of this varied information. Sig.ma describes itself as a semantic information mash-up. If you take a look at the page that sig.ma provides for Beatrice Webb, a substantial amount of data is pulled in, but it is not very user-friendly, not always very coherent and sometimes not relevant. Obviously it is just a demonstrator, and I would say it is for a different audience, with more expertise in Linked Data. It does show the potential for this type of approach, that draws in a really diverse range of data on on-the-fly, but it also shows how semantic searching is complex and difficult to achieve within a user-friendly interface.

The Linking Lives Unique Selling Point

Sites like Wikipedia have biographical pages, and we can never compete with them, so what can we offer that is of value? Essentially, our focus is on meeting the needs of those who want to carry out more in-depth research and who are likely to use primary sources. It may not be people who know they want to use primary sources, it may actually be a means to bring people to archives for the first time (we know that a large proportion of Archives Hub users are first time users of the Hub, and have not necessarily used archives before). We want to make primary sources the focus, but at the same time put them within the context of a whole range of information sources about a person, so that they are not held apart as somehow different and not for mainstream researchers.

It is also worth pointing out that our interface will still in some sense be a demonstrator – it will provide one option for presenting our Linked Data, but the data is there for others to create their own interfaces, and the Sparql endpoint is there for people to query the data in the ways they want to.  In addition, we can re-expose the data that we present. So, there are several purposes here: benefiting end-users, evaluating a name-based approach and putting archives within a broader context, demonstrating the sort of interface that can be provided from Linked Data and possibly re-exposing the data to create more potential benefit.

 

 

 

 

 

 

 

 

 

 

 

 

 

Posted in archival context, branding, interface | 4 Comments

One Person in Context – Working with Biographical Histories

I have been starting to think about the user interface for Linking Lives. We will probably go for something quite simple in terms of layout, because there is quite a bit of complexity when bringing together a range of data sources.

It may be thought that integrating the external data sources is the challenge, but I think that it is probably more of a challenge to integrate several archival descriptions into one biographical record and also to convey the context of the archival descriptions clearly.

In this post, I am focusing on that often very useful field of information, the biographical history.  This is a field that is used to help place the archives in their context, by providing significant and relevant information about their creator(s). It is widely used in archives, although there are increasingly moves to exclude this information from the actual collection description and provide it separately. There are a few observations worth making about this field:

  • In general, it is considered good practice for the biographical history to be appropriate to the records being described. So, you don’t include a full life story when you are describing one letter relating largely to one event in a person’s life….
  • …but this guidance is not always adhered to, so some biogs are long and detailed for a small and discreet collection, others are very brief, even though they may relate to an archive that spans the individual’s entire life.
  • Some repositories will use largely, or entirely, the same biog for different collections about one individual, others will create very distinct biogs, and some may use biogs that have  been created by other institutions.
  • Some biogs will involve a significant amount of research, with the archivist drawing on the unique sources they are cataloguing to provide information that may then be quite unique in itself, making this field particularly useful for researchers.

I am going to use the example of Martha Beatrice Webb here, a significant figure in history, and one with plenty of archival sources that relate to her.

Photo of Martha Beatrice Webb

From the LSE collection on Flickr

On the Hub we have 14 collections where Beatrice Webb is the ‘creator’ or co-creator of the archive (for information on archival creators see a post on the Hub blog, Who is the creator?).  These collections are from three different archive repositories. Here is a selection of the biographical histories (not all yet available from our Linked Data store):

Beatrice Webb (1858-1943), nee Potter, social reformer and diarist. Married to Sidney Webb, pioneers of social science. She was involved in many spheres of political and social activity including the Labour Party, Fabianism, social observation, investigations into poverty, development of socialism, the foundation of the National Health Service and post war welfare state, the London School of Economics, and the New Statesman.
(from A summer holiday in Scotland)

Beatrice Webb (1858 – 1943). Fabian Socialist, social reformer, writer, historian, diarist. Wife, collaborator and assistant of Sidney Webb, later Lord Passfield. Together they contributed to the radical ideology first of the Liberal Party and later of the Labour Party.
(from Letters)

The role of the Reconstruction Committee involved ‘…surveying and unravelling the whole tangle of governmental activities’ introduced during World War I (1914 – 1918). It was established in early 1917 but by July 1918 had been disbanded, Webb reporting that its ‘…machinery was too rickety to survive’.
(from Webb Beatrice 1858-1943 nee Potter)

Beatrice and Sidney Webb were pioneering social economists, early members of the Fabian Society and co-founders of the London School of Economic and Political Science, and had a profound effect on English social thought and institutions. Beatrice Potter Webb was born in 1858, the eighth daughter of Richard Potter, a wealthy businessman, and Lawrencina Heyworth. Surrounded from an early age by her parents’ intellectual and worldly friends and visitors, notably the philosopher Herbert Spencer, she was largely self-educated through copious reading, and frequently a partner for her father during business trips abroad. Following a tempestuous relationship with Joseph Chamberlain, which began in 1883 and lasted several years, Beatrice took up social work in London, acting as a rent collector for the Charity Organisation Society, and becoming steadily disillusioned by the inability of charitable organisations to tackle the basic causes of poverty. During 1886, she participated in research for Charles Booth’s investigations into London labour conditions, eventually contributing to Volume I of Life and Labour of the People of London (1889). During this period she continued to write articles on social subjects, most of which were printed in The nineteenth century , and published The co-operative movement in Great Britain (1891). She met Sidney Webb in 1890 during research into economic conditions and labour unions. Sidney Webb was born in London in 1859. Educated in the local academy, he left school at sixteen to work as a clerk in a colonial brokers. By attending evening classes, he passed the civil service exams in 1881 and was appointed a clerk in the Inland Revenue. The following year, he took the Civil Service upper division examination and was appointed to the Colonial Office in 1883. He also began lecturing on political economy at the Working Men’s College. Webb was a close friend of George Bernard Shaw, who induced him to join the socialist Fabian Society in 1885, where both men became leading members: Webb was responsible for putting forward the first concise expression of Fabian convictions in Facts for Socialists (Fabian Tract 5, 1887). As a member of the Fabian executive, Webb continued to write and lecture extensively on economic and social issues, and took a leading role in Fabian policy-making…..…….[cont'd]
(from Webb, Beatrice, 1858-1943 and Webb, Sidney, 1849-1947, social reformers and historians)

If we want to create a biographical page for Beatrice Webb ideally we would have one biog that combines the best of all of the 14 available. However, apart from this being pretty much impossible, we come back to the fact that they are often appropriate to specific collection descriptions. You can see a good example of this above, where the text refers to the ‘Reconstruction Committee’, although the title does not, in fact, tell you that this is what the collection is about.  There are also clearly some issues with two of these titles, which are not really titles at all, but names of creators, but that’s another story…

For researchers, the prospect of trawling through 14 biog entries may not seem very enticing. We do have the option to use one as the default display and then provide links to the others, but then which to pick and why?

So that leaves us with listing all of the biogs along with the collection titles. Possibly a rather unwieldy answer, but on the other hand, it could be argued that this is an improvement on researchers having to click through 14 separate records. It does at least pull the biographical information together to some extent.

In terms of our data modelling, the great thing about Linked Data is that we can decide what to say about entities within the data. For Locah, we have linked bioghist to the agent – so in this case the agent is Beatrice Webb (or Beatrice and Sydney Webb) – and we have also linked it to the ‘Archival Resource’ (the collection itself). We could decide to say that a bioghist is about someone strictly in the context of one archival resource, rather than making a link directly with the agent, but this would probably complicate things too much.

The SNAC project in the US (Social Networks in Archival Context) is working on creating archival authority records, which is a little like our project to create biographical records, but they are using a distinctly archival standard, EAC-CPF, and not incorporating external data within the records (though it may be referenced on their interface). Most of the people on their prototype have only created one collection, which makes life easier, but looking at the entry for Ella Fitzgerald, there are two collections. You can see that both biogs are displayed, and the source for each is given. It is interesting to note with this display how the source is given less prominence, being given in smaller letters at the end of the text. Another example, Royal Chicano Air Force, provides two biogs, but they are both the same apart from a small addition to one, even though the collections are held in different institutions.

I should emphasise that the SNAC interface is a prototype, and I know they will be doing more work on the display, so I’m not out to be critical (I think its a great initiative). But I do wonder whether it is a good idea to display all the biog entries one after the other with not much emphasis on where they come from and hence why there are several of them, maybe with substantial repetition. If they had an entry for Beatrice Webb with our 14 collection descriptions the biog entries would create one very very long page.

I think that we may look at including all of the biog entries, clearly linking them to the collection titles, but possibly only displaying a limited number of words for each, with the option to go to the full entry. That way we can include all of them, give a sense of what each of them provides, and let the user decide where to go from there.

Another avenue we would like to explore is extracting concepts from this data, and maybe that would be a way to start to find common concepts within a number of biogs. But we’ll have to see how far we manage to get with that particular challenge.

 

 

 

Posted in archival context, biographical history, interface | Tagged , | 1 Comment

A Little Bit About Licensing

The Linking Lives project aims to deliver:

  • An end-user interface that provides a means to integrate archival data with other information sources
  • Blog posts that share our progress and reflect on the work
  • Reusable software outputs for manipulating RDF and formatting within Web pages
  • An evaluation report
  • Documentation setting out the data sources and relationships behind the interface

You can read more about it on our ‘About Us’ page.

First things first. We currently have a Linked Data store with a small amount of Archives Hub data. We need to expand this considerably. Our aim is to provide a substantial amount of the Hub data, preferably the entire data set, as Linked Data, and then it will be part of the Linking Lives interface.

We had already consulted with Hub contributors about our Linked Data work, but in order to really expand the data set, we need to make it very clear to them what we want to do. The Archives Hub is an aggregation of data from over 200 archives across the UK, so we are in a very unique situation, and we want to work with archivists to move the community towards an open data agenda. It is vital for us to show our contributors that we are working on their behalf, and that they will be fully informed about our plans and progress.

We feel that it is important to give the data an explicit licence, preferably a completely open licence. That way we don’t put any barriers in the way of its potential reuse. I was recently at the Europeana Tech conference in Vienna, and the dominant theme of the conference was the fundamental importance of open data. One observation that struck me was the conclusion from Europeana participants that it is  better to put less data out but put it out under an open licence, than put more data out but compromise with a complex and/or restrictive licence. Some of the Archives Hub contributors have been concerned about commercial exploitation. It is worth looking at Jill Cousins’ presentation on this. She argues that even a non-commercial licence means that you are substantially restricting the potential of the data. It can’t be used on any sites or cultural blogs that demonstrate any commercial activity, it can’t be used with Wikipedia or by commercial companies that might generate income for partners.

We need to bring Hub contributors on board with this vision, and to do this we sent out an email to all contributors outlining our proposal and asking that they let us know if the do not want to participate.

In the email I did the following:

1) Set out the benefits of Linked Open Data

2) Described the Linking Lives proposal

3) Referred to the potential for us to be involved with the US-based ‘SNAC‘ project. This is not a Linked Data project, but it is creating name authority files using the archival standard of EAC-CPF, and I wanted to show that we are working on different fronts with the aim of improving access to archives.  I do think it’s worth giving this kind of context; showing that services like the Hub are working in different ways on behalf of archives to promote understanding and use of primary source material.

4) Referred to the options for licensing, referring to the possibility of an attribution licence, although ideally we would still opt for a completely open licence and strongly promote best practice around attribution (and we are looking at named graphs with this in mind, as a means to ensure that the provenance of statements can be shown).

5) Emphasised that this is about the metadata, not the content. This may sound obvious, but it is an important distinction. The metadata is there to promote the collections. There are far more complex issues around open access to some collections, where there are legal issues around IPR.

6) Referred to some useful sources to read more about open data and some initiatives, such as Europeana and Discovery, that are fully behind an open data approach.

I think the real potential of Linked Data is still difficult for people to grasp. I pointed to things like Tim Sherratt’s recent work, creating a narrative using the Web of Data, as this is a great way to demonstrate the possible uses of this structured data, and I also referred to established and respected institutions like the BBC leading the way with using different data sources and taking the risk of incorporating Wikipedia data on their site.

So far we have had two contributors asking to opt out of the Linked Data work, one very small archive and one large HE archive. We have also had some questions about what the work involves, questions that show a certain level of concern (as you would expect), albeit with an overall positive attitude towards open data. Maybe we need more explicit help with licensing archival data. There are a number of useful sources, such as the Licensing Open Data guide (PDF) available from the Discovery website, but it would be useful to have a document that specifically refers to opening up archival metadata, and maybe more information on the issues around data aggregations.

Several contributors have written to us to show their support, including the Universities of London, Dundee and Hull. We are very pleased that two of our biggest contributors, the University of Glasgow and John Rylands Library at the University of Manchester, have shown very strong support. We’re going to be adding their data to our triple store in the near future, as they have large collection descriptions with thousands of component items, so that will be a good test for our stylesheet. Institutions like this have some great archives, and detailed descriptions that lend themselves to strong narratives, linking up people, places, events, to create a whole host of different stories.

We are still working on exactly which licence to use for the Archives Hub data, but we are certain that it will be open, as this is vital to ensuring that we can truly connect data. As Edward L Ayers wrote, back in 1999: “Might history, which exists in symbiosis with large amounts of diverse evidence, be especially well-suited for the technology evolving around us?” (from History in Hypertext). I think that the answer is ‘yes’, and I think that Linked Data promises much if it really does become embedded in the Web.

 

 

Posted in licensing, open data | 1 Comment