Collaboration and crowdsourcing for Old Bailey Online and London Lives

My digital crime history talk included some mention of ‘crowd sourcing’ and our stuttering efforts in this direction (on various projects) over the last five years or so. This post is intended as a marker to get down some further thoughts on the subject that I’ve been mulling over recently, to start to move towards more concrete proposals for action.

Two years ago, we added to OBO a simple form for registered users to report errors: one click on the trial page itself. People have been using that form not simply to report errors in our transcriptions but to add information and tell us about mistakes in the original. The desire on the part of our site users to contribute what they know is there.

We now have a (small but growing) database of these corrections and additions, which we rather failed to foresee and currently have no way of using. There is some good stuff! Examples:

t18431211-307    The surname of the defendents are incorrect. They should be Jacob Allwood and Joseph Allwood and not ALLGOOD

t18340220-125    The text gives an address Edden St-Regent St. I believe the correct street name is Heddon St, which crosses Regent St. There is no Edden St nearby. There is an Eden St and an Emden St nearby, but neither meet Regent St.

t18730113-138    The surname of the defendant in this case was Zacharoff, not Bacharoff as the original printed Proceedings show. He was the man later internationally notorious as Sir Basil Zaharoff, the great arms dealer and General Representative abroad of the Vickers armaments combine. See DNB for Zaharoff.

t18941022-797    Correct surname of prisoner is Crowder see Morning Post 25.10.1894. Charged with attempted murder  not murder see previous citation.

It also bothers me, I’d add, that there’s no way of providing any feedback (let alone encouragement or thanks). If I disagree with a proposed correction, I don’t have a way to let the person reporting the issue know that I’ve even looked at it, let alone explain my reasoning  (someone suggested, for example, that the murder of a two-year-old child ought to be categorised as ‘infanticide’, but we use that term only for a specific form of newborn infant killing that was prosecuted under a particular statute during the period of the Proceedings).

On top of which, I think it’s going to become an increasing struggle to keep up even with straightforward transcription corrections because the method we’ve always used for doing this now has considerably more friction built in than the method for reporting problems!

So, the first set of problems includes:

  • finding ways to enable site users to post the information they have so that it can be added to the site in a useful way (not forgetting that this would create issues around security, spam, moderation, etc)
  • improving our own workflow for manual corrections to the data
  • solving a long-standing issue of what to do about names that were wrongly spelt by the reporters or have variant spellings and alternatives, which makes it hard for users to search for those people
  • maybe also some way of providing feedback

A possible solution, then, would be a browser-based collaborative interface (for both Old Bailey Online and London Lives), with the facility to view text against image and post contributions.

  • It should be multi-purpose, with differing permissions levels for project staff and registered users.
  • Corrections from users would have to be verified by admin staff, but this would still be much quicker and simpler than the current set-up.
  • But it would be able to do more than just corrections – there would be a way of adding comments/connections/annotations to trials or documents (and to individual people).

A rather different and more programmatic approach to (some of) the errors in the OBO texts than our individualised (ie, random…) and manual procedures was raised recently by Adam Crymble.

For such a large corpus, the OBO is remarkably accurate. The 51 million words in the set of records between 1674 and 1834 were transcribed entirely manually by two independent typists. The transcriptions of each typist was then compared and any discrepancies were corrected by a third person. Since it is unlikely that two independent professional typists would make the same mistakes, this process known as “double rekeying” ensures the accuracy of the finished text.

But typists do make mistakes, as do we all. How often? By my best guess, about once every 4,000 words, or about 15,000-20,000 total transcription errors across 51 million words. How do I know that, and what can we do about it?

… I ran each unique string of characters in the corpus through a series of four English language dictionaries containing roughly 80,000 words, as well as a list of 60,000 surnames known to be present in the London area by the mid-nineteenth century. Any word in neither of these lists has been put into a third list (which I’ve called the “unidentified list”). This unidentified list contains 43,000 unique “words” and I believe is the best place to look for transcription errors.

Adam notes that this is complicated by the fact that many of the ‘errors’ are not really errors; some are archaisms or foreign words that don’t appear in the dictionaries, and some (again) are typos in the original.

Certain types of error that he identified could potentially be addressed with an automated process, such as the notorious confusion of the long ‘S’ with ‘f’: “By writing a Python program that changed the letter F to an S and vise versa, I was able to check if making such a change created a word that was in fact an English word.”

But any entirely automated procedure would inevitably introduce some new errors, which we’re obviously reluctant to do (being pedantic historians and all that). So what to do?

Well, why not combine the power of the computer and the ‘crowd’? We could take Adam’s ‘unidentified list’ as a starting point, so that we’re narrowing down the scale of the task, and design around it a specific simplified and streamlined corrections process, within the framework of the main user interface. My initial thoughts are:

  • this interface would only show small snippets of trials (enough text around the problem word to give some context) and highlight the problem word itself alongside the page image (unfortunately, one thing we probably couldn’t do is to highlight the word in the page image itself)
  • it would provide simple buttons to check for a) the dictionary version or a letter switch, or b) the transcription is correct; with c) a text input field as a fallback if the correction needed is more complex; hopefully most answers would be a or b!
  • if at least two (maybe three?) people provide the same checkbox answer for a problem word it would be treated as a verified correction (though this could be overruled by a project admin), while text answers would have to go to admins for checking and verification in the same way as additions/corrections submitted in the main interface.
  • we should be able to group problems by different types to some extent (eg, so people who wanted to fix long S problems could focus on those)

Suggestions from people who know more than me about both the computing and the crowdsourcing issues would be very welcome!

London Lives

At last it’s official, and the work I started in Sheffield in 2006 (yikes!) is almost complete:

London Lives 1690-1800 is open for business.

There are more than 200,000 pages of manuscript material from parish, criminal justice and hospital records, transcribed and marked up for searching in the same way as the Old Bailey Proceedings Online. Plus the 18th-century Proceedings and Ordinary’s Accounts and a group of additional datasets.

The emphasis is on searching for people (although there is also a keyword search) and on nominal record linkage, to facilitate writing the biographies of ordinary and extraordinary 18th-century Londoners.

We’ve started some biographies for you. We’ve also written extensive background material and information about the project itself.

Oh, and next week we’re holding a conference at the University of Hertfordshire to mark the launch.

Do go and explore for yourselves.

Names and people in early modern sources

In my working capacity as the Oracle of the OBP Online, I was recently asked a question that went something like this (details changed):

I’m confused by all these results. If Robert Scott was hanged in 1765, who are all these other Robert Scotts? And some of them are after 1765?!

This is at first glance a slightly daft question – well, obviously, they’re all different people but with the same name, aren’t they? (The question also contains a common misconception about the source, which I’ll come back to in a moment.) And yet, at the same time, it’s not really silly at all.

They might not all be different people. In our database of the names in the OBP there are 142 instances of the name ‘Robert Scott’ (including slight spelling variations). (Mind you, this is nothing compared to a name like John Smith, which occurs more than 4000 times.) How do you decide whether one Robert Scott is the same person as another Robert Scott, or someone else altogether?

And this is without even starting on the problem that a significant proportion of those appearing at the Old Bailey were known by more than one name, and some had a string of aliases and nicknames. Oh, and the reporters (or printers…) sometimes got people’s names – even those of defendants – just plain wrong.

In other words, identifying the relationship between names and people in early modern sources is often extremely tricky, and the question ‘who the hell are all these Robert Scotts?’ isn’t so daft. Which is just as well, really, because this is precisely the kind of problem that’ll be keeping me in work for the next couple of years.

This isn’t just of concern to family historians trying to work out whether someone is really their ancestor or not. Most historians have to make these linkages, ask these questions, at some time or another in the course of their research. Most of us do it on a small scale by hand; a more select group do it on the large scale with computers and algorithms. I’ll hopefully post about both of these later. But in both cases, the process relies on weighing up and ranking probabilities.

Sometimes the answer, either way, is so obvious that the question doesn’t even need to be consciously formed. But at the other end of the scale, there are times when it’s impossible ever to know because you simply don’t have enough information, especially if a name is very common and you have very little contextual information besides the name itself. And I’m sure other historians will have encountered those frustrating borderline cases: if those documents are all referring to the same person, you have a great story. But are you certain enough to rest a serious argument on that identification?

It’s true, for example, that death is a clincher: if you know this Robert Scott died in 1765, then he can’t be the same person as that Robert Scott mentioned in records as alive and well in 1775. (At the other end of the life-cycle, birth is equally conclusive, of course.)

But are you sure he died?

The OBP doesn’t in fact tell you that Robert was hanged (this is the misconception I mentioned above); like archival records from early modern criminal courts, it normally records only the sentence that was passed. But many people sentenced to death in the 18th century were reprieved or pardoned. Unless you have corroborating evidence that the execution was carried out (this does occasionally appear in OBP), you need to be cautious.

So a Robert Scott in the database after 1765 could be the same guy after all. Told you it was tricky.

(To be continued…)

A few links (because the place just isn’t the same without them):

The linkage of historical records by man and computer (JSTOR subscription required)
A discourse on method, historical knowledge and information technology
Reconstructing historical communities
AHDS guide

(X-posted at The Long Eighteenth.)

Ah yes, the job

Mmm, let me tell you a bit more about what I’m getting up to. I’ve often waxed lyrical about The Old Bailey Proceedings Online (and see the OBP Blog Symposium.). So it’s rather delightfully serendipitous that my new job is as project manager for two new, related London history projects, based in the Humanities Research Institute at the University of Sheffield.

The first and relatively simple task is to complete the OBP job by adding the final run of proceedings from 1834-1913 (under the title of Central Criminal Court proceedings), integrating them into the existing site. In total, this will create a fully searchable major digital primary source for London history, and particularly for the history of non-elite Londoners, running right through from the late 17th century into the early 20th century.

The 18th-century project, Plebeian Lives and the Making of Modern London 1690-1800, is much more difficult and complex. Like many other early modern and 19th-century digital primary sources, the OB/CCC proceedings are printed texts – relatively easy to read and transcribe, and to mark up for digitisation. But the majority of the Plebeian Lives sources will be archival manuscript materials. They will cover a wide range: including legal records such as coroners’ inquests; parish records (eg: pauper letters, vestry minute books); the records of Bridewell and Bethlem hospital; apprenticeship records. There’ll also be printed texts, such as Ordinary’s Accounts.

Like the Old Bailey/Central Criminal Court databases, they’ll all end up online: thousands of documents, full text, fully searchable, freely available to all internet users without any subscription barriers. What’s more, we hope to construct a search engine that will make it possible to simultaneously search a number of related online primary source resources alongside ours, including the OBP, and others at different sites such as British History Online.

This is the goal, at least. (I am terrified, whenever I stop being insanely excited.) Right now, all I have for this is a humungous (1 terabyte) hard drive filled with the first batch of scanned document images (very large, high quality .tif files, which is why they take up so much drive space).

The practical difficulties are not minor. Every phase of the process is lengthy and much of it (to be honest) fairly tedious, for both projects. All those documents and printed texts must first of all be microfilmed, scanned, and ‘rekeyed’ (transcribed): that part of it is outsourced, although we have to produce various documentation to guide the rekeyers (and generally nag and cajole the contractors to give us what we want when we want it). Some of the documents will be much better preserved and/or easier to decipher than others.

Then we have to mark up the transcripts in XML, another dull and painstaking task, which will be undertaken in two ways over the next 2 years or so. Right now and with my, um, ‘help’, the HRI programmers are writing fearfully complicated programs that will do substantial sections of the CCC transcripts automatically; the rest will be done manually by several part-time, home-based workers (some of them are postgrad students) who will start this autumn.

Once that markup is done, the CCC project will be quite straightforward to finish off, since it will be essentially a matter of adding it to the existing OBP database and giving it a few tweaks. But for our 18th-century plebeians, our job will barely have begun.

Firstly, the HRI people have to create a powerful search engine that anyone can use fairly easily and, of course, we have to create a web site to present it. We hope that many people with 18th-century interests, from genealogists to academics, will find their own ways of using the resource. What we want to do with it is to analyse the data in order to “reconstruct how ‘ordinary’ Londoners interacted with various government and charitable institutions in the course of their daily lives”. We’ll be doing large scale quantitative analysis and record linkage (to find out, for example, patterns of relationships between claiming poor relief and ending up as a victim or perpetrator of crime). The technique of nominal record linkage has tended to be applied to small rural populations: the computer made record linkage practical in the first place, now the internet is making possible the extension of its methods to the teeming metropolis. On the other hand, we want to do qualitative analysis: where we can find rich enough information about individuals, we’ll trace their individual experiences and uses of the institutions available to them.

I (eventually) get the fun job of writing biographies to put on the website. My bosses have to sit down and write the serious monograph.

I think I have one of the coolest jobs in the universe right now.

. . .

[Parts of this post have been revised and x-posted at my other new bloghome, The Long Eighteenth Century.]