My digital crime history talk included some mention of ‘crowd sourcing’ and our stuttering efforts in this direction (on various projects) over the last five years or so. This post is intended as a marker to get down some further thoughts on the subject that I’ve been mulling over recently, to start to move towards more concrete proposals for action.
Two years ago, we added to OBO a simple form for registered users to report errors: one click on the trial page itself. People have been using that form not simply to report errors in our transcriptions but to add information and tell us about mistakes in the original. The desire on the part of our site users to contribute what they know is there.
We now have a (small but growing) database of these corrections and additions, which we rather failed to foresee and currently have no way of using. There is some good stuff! Examples:
t18431211-307 The surname of the defendents are incorrect. They should be Jacob Allwood and Joseph Allwood and not ALLGOOD
t18340220-125 The text gives an address Edden St-Regent St. I believe the correct street name is Heddon St, which crosses Regent St. There is no Edden St nearby. There is an Eden St and an Emden St nearby, but neither meet Regent St.
t18730113-138 The surname of the defendant in this case was Zacharoff, not Bacharoff as the original printed Proceedings show. He was the man later internationally notorious as Sir Basil Zaharoff, the great arms dealer and General Representative abroad of the Vickers armaments combine. See DNB for Zaharoff.
t18941022-797 Correct surname of prisoner is Crowder see Morning Post 25.10.1894. Charged with attempted murder not murder see previous citation.
It also bothers me, I’d add, that there’s no way of providing any feedback (let alone encouragement or thanks). If I disagree with a proposed correction, I don’t have a way to let the person reporting the issue know that I’ve even looked at it, let alone explain my reasoning (someone suggested, for example, that the murder of a two-year-old child ought to be categorised as ‘infanticide’, but we use that term only for a specific form of newborn infant killing that was prosecuted under a particular statute during the period of the Proceedings).
On top of which, I think it’s going to become an increasing struggle to keep up even with straightforward transcription corrections because the method we’ve always used for doing this now has considerably more friction built in than the method for reporting problems!
So, the first set of problems includes:
- finding ways to enable site users to post the information they have so that it can be added to the site in a useful way (not forgetting that this would create issues around security, spam, moderation, etc)
- improving our own workflow for manual corrections to the data
- solving a long-standing issue of what to do about names that were wrongly spelt by the reporters or have variant spellings and alternatives, which makes it hard for users to search for those people
- maybe also some way of providing feedback
A possible solution, then, would be a browser-based collaborative interface (for both Old Bailey Online and London Lives), with the facility to view text against image and post contributions.
- It should be multi-purpose, with differing permissions levels for project staff and registered users.
- Corrections from users would have to be verified by admin staff, but this would still be much quicker and simpler than the current set-up.
- But it would be able to do more than just corrections – there would be a way of adding comments/connections/annotations to trials or documents (and to individual people).
A rather different and more programmatic approach to (some of) the errors in the OBO texts than our individualised (ie, random…) and manual procedures was raised recently by Adam Crymble.
For such a large corpus, the OBO is remarkably accurate. The 51 million words in the set of records between 1674 and 1834 were transcribed entirely manually by two independent typists. The transcriptions of each typist was then compared and any discrepancies were corrected by a third person. Since it is unlikely that two independent professional typists would make the same mistakes, this process known as “double rekeying” ensures the accuracy of the finished text.
But typists do make mistakes, as do we all. How often? By my best guess, about once every 4,000 words, or about 15,000-20,000 total transcription errors across 51 million words. How do I know that, and what can we do about it?
… I ran each unique string of characters in the corpus through a series of four English language dictionaries containing roughly 80,000 words, as well as a list of 60,000 surnames known to be present in the London area by the mid-nineteenth century. Any word in neither of these lists has been put into a third list (which I’ve called the “unidentified list”). This unidentified list contains 43,000 unique “words” and I believe is the best place to look for transcription errors.
Adam notes that this is complicated by the fact that many of the ‘errors’ are not really errors; some are archaisms or foreign words that don’t appear in the dictionaries, and some (again) are typos in the original.
Certain types of error that he identified could potentially be addressed with an automated process, such as the notorious confusion of the long ‘S’ with ‘f': “By writing a Python program that changed the letter F to an S and vise versa, I was able to check if making such a change created a word that was in fact an English word.”
But any entirely automated procedure would inevitably introduce some new errors, which we’re obviously reluctant to do (being pedantic historians and all that). So what to do?
Well, why not combine the power of the computer and the ‘crowd’? We could take Adam’s ‘unidentified list’ as a starting point, so that we’re narrowing down the scale of the task, and design around it a specific simplified and streamlined corrections process, within the framework of the main user interface. My initial thoughts are:
- this interface would only show small snippets of trials (enough text around the problem word to give some context) and highlight the problem word itself alongside the page image (unfortunately, one thing we probably couldn’t do is to highlight the word in the page image itself)
- it would provide simple buttons to check for a) the dictionary version or a letter switch, or b) the transcription is correct; with c) a text input field as a fallback if the correction needed is more complex; hopefully most answers would be a or b!
- if at least two (maybe three?) people provide the same checkbox answer for a problem word it would be treated as a verified correction (though this could be overruled by a project admin), while text answers would have to go to admins for checking and verification in the same way as additions/corrections submitted in the main interface.
- we should be able to group problems by different types to some extent (eg, so people who wanted to fix long S problems could focus on those)
Suggestions from people who know more than me about both the computing and the crowdsourcing issues would be very welcome!