We’re holding an afternoon workshop on record/data linkage in Sheffield on 4 November. The aim is to explore the challenges and rewards of applying automated nominal record linkage to large-scale historical datasets, with all their variability, fuzziness and uncertainties, but we’d also very much welcome participants and insights from all fields concerned with data linkage including social sciences, health sciences and computer science. In addition to presentations about our work in progress on 90,000 19th-century prisoners and convicts, we have guest speakers who will bring extensive experience of historical record linkage projects to the discussion. It’s free to attend and anyone with an interest, at any stage of their academic career, is welcome (I’d particularly love to see plenty of PhD students!). More info can be found on our website here (and there’s also a programme to download).
Record linkage is really at the heart of the Digital Panopticon project’s goals to explore the impact of the different types of punishments on Old Bailey Online defendants between about 1780 and 1875 (along with working on data visualisations for exploring, presenting and communicating the data and research findings). Our research questions include: How can we improve current record-linkage processes to maximise both the number of individuals linked across different datasets and the amount of information obtained about each individual? What is the minimum amount of contextual information needed in order to conduct successful large-scale record linkage of data pertaining to specific individuals?
I’ve blogged in the past about problems associated with historical record linkage where you don’t have handy unique IDs (like, say, National Insurance numbers): names are often crucial but are highly problematic, and problems with a source like Old Bailey Online that tells us about sentences but not actual punishments. Those are among our biggest headaches with Digital Panopticon.
There are a lot of missing people when we link OBO to transportation records, and a lot of possible reasons for linking to fail. There might be errors in the data created at almost any point between the making of the original source and our production of a specific dataset to feed to the computer: eg, if you’re extracting a London-only subset from a national dataset and you’re not careful, you might also end up with records from Londonderry. Oops. (“You” there is an euphemism for “I”. )
Then there are problems caused by spelling variations in names, or the use of aliases and different names. And the problem of common names. As I blogged before: “How do you decide whether one Robert Scott is the same person as another Robert Scott, or someone else altogether?” But that gets much worse when the name in question is “Mary Smith”.
And the fails that are due to the gaps in our data: Were they pardoned? Did they die in prison or on the hulks before they could be transported? And so we are on a quest to track down sources that can tell us these things and fill the gaps (not all of which have been digitised; some of which have probably not even survived, especially from the 18th century).
Irreconcilable conflicts can emerge between different sources (eg, different trial dates and places). At this point we have to turn to the specialist knowledge of the project team on how, when and where particular sources were created so we can attempt to rate the relative reliability of two conflicting sources. But how are we going to handle those weightings when we’re dealing with thousands of people and the links are all probables anyway? (Just because source A is generally more reliable for a certain piece of information than source B doesn’t mean A is always right and B is always wrong if they’re in conflict.)
So there will be plenty to discuss at the workshop and for the next three years!
For tasters of what we’ve been getting up to so far: