Do not underestimate cleaning your data!

In Linked Open Data: The Essentials (Bauer, Kaltenbock) The first steps given for publishing your content as LOD are:

1. Analyse your data

2. Clean your data

3. Model your data

…and it goes on to very helpfully summarise the further steps required. The steps given are typical of the advice often given about how to create Linked Data.

Under ‘Clean your data’ it states:

Data and information that comes from many distributed data sources and in several different formats (e.g. databases, XML, CSV, Geodata, etc.) require additional effort to ensure easy and efficient modelling. This includes ridding your data and information of any additional information that will not be included in your published data sets.

In retrospect, I greatly underestimated this particular step. Format is fine as far as we are concerned, but our data does come from many data sources – from over 200 sources in fact. I’m not sure about ridding the data of additional information, but for us issues around data consistency have created a very significant amount of extra work; work that I did not properly factor into the process.

Before I say any more about this, I want to make one thing clear: in talking about inconsistency and ‘errors’ in the data, I am not wanting to criticise the Archives Hub contributors at all. For a start, much of the data in the Hub was created over many years, and much has been migrated from many different systems. Secondly, we were simply not thinking in a Linked Data way 5 or 10 years ago. We didn’t necessarily prioritise consistency in instances where it now becomes much much more important. We didn’t ask for things that we now ask for, or ensure checks were made for certain data. We had other priorities, and the challenge of just creating an aggregator from scratch was pretty huge.

In Linked Data, you are bringing all (or many) of the entities within the data to the fore. In a way, it’s as if they can’t hide anymore; they can’t just sit within the context of the collection description and display themselves to users through a Web interface. They have to work a bit harder than that because essentially they all become lead players. And it feels to me as if this is what really makes the quality of the data so important.

I have recently blogged about the issue we have had with identifiers. This is probably the biggest issue we have to deal with. But others have come up. For example, some of our descriptions have ‘Reference’, as you would expect, but they also have ‘Former Reference’ (both in the same tag of ‘unitid’). The problem with this is that it is not always encoded consistently, so then it becomes hard to say ‘where X is included do Y’.

Another example is where we have two or more creators for a description. Up until now, we have simply had one field for contributors to add ‘name of creator’ (the EAD ‘origination’ tag) but that means that two or more names simply go into the same field are not made distinct in a way that a machine can process. It’s fine for display. A human knows that Dr James Laidlaw Maxwell, Dr James Preston Maxwell means two people. But it is harder for a machine to distinguish names if there isn’t a consistent separator. In Linked Data terms it may mean that you end up with a person effectively identified as ‘drjameslaidlawmaxwell,drjamesprestonmaxwell’. (The comma may seem like a reasonable separator, but often commas exist within names, as they can be inverted, and other entries don’t use a comma).

During our Linked Data work, what we have done when we find a data issue is to make a decision whether the issue will be dealt with through the conversion process or dealt with at source. In general, I think its worth dealing with issues at source, because it tends to mean the quality and consistency of the data (thinking particularly in terms of markup) is improved.

Furthermore, this emphasis on the data has led us to think quite fundamentally about many aspects of our data structure, the ways that we ask people to create descriptions and how we can improve our ‘EAD Editor’ in order to ensure more consistency – not just from a Linked Data perspective. It has contributed to a decision to make this kind of data editing more rigorous and better documented. It has also made us think about how to convey what is good practice more effectively, bearing in mind that many people don’t have much of a sense of what might be needed for Linked Data.

However, the other side of the coin is the realisation that¬† you cannot clean your data perfectly. We have over 25,000 collection descriptions and many 100,000′s of lower level entries. It is likely that we will have to live with a certain level of variation, because some cleaning up would be very hard to do other than manually. Our data will always come from a variety of sources, and it may actually be that our move towards importing data from other systems actually introduces more variation. For example, I recently found that a number of descriptions from one contributor, exported from another system, did not provide the ‘creator’ entry as a structured access point (index term).¬† This is a disadvantage with Linked Data, where you are trying to uniquely identify individuals and match that name to other instances of the same person.

Data cleaning can sometimes feel like a can of worms, and I warn those with similar aggregated data, or data from different sources, that dealing with this can really start to eat away at your time! I would certainly advise starting off by thinking about workflow for data cleaning – the reporting, decision making, documenting, addressing, testing, signing-off – whatever you need to do. In retrospect I would have started a spreadsheet straight off. But, overall I think that it has been good for us to think more carefully about our data standards and how we can improve consistency. I feel that it’s something we should address, whether or not Linked Data is involved, because it increases the potential of the data, e.g. for creating visualisations, and it generally makes it more interoperable.

 

 

 

This entry was posted in barriers, data cleaning, data processing, identifiers, linked data. Bookmark the permalink.

One Response to Do not underestimate cleaning your data!

  1. Pingback: URIs, identity, aliases & “consolidation” | Linking Lives

Comments are closed.