Tales of the Unexpected: or, what can happen when you let a bunch of criminals loose on the Internet

One day towards the end of the last millennium, a pair of historians of early modern London hatched a crazy plan to digitise a massive and obscure (to everyone except a few academic crime and legal historians) primary source, published between the 1670s and 1913, and known variously as the Old Bailey Sessions Papers or Old Bailey Proceedings. Part of the challenge, apart from its sheer volume, was that they wanted to capture two very different kinds of information. The consistent format of the Proceedings and the fact that for much of its existence it had been a quasi-official record of all the trials held at the court made it an ideal candidate for a structured database approach that would enable long-term quantitative analyses. But at the same time the trial reports possessed many rich, engaging witness narratives that could only be truly represented by full text digitisation.

This dual identity was resolved by creating full text transcriptions – rekeyed by humans rather than OCRed – that were tagged with XML for database structure. This was a crucially important decision (‘even if it was through luck rather than expertise’). It did have its downsides. It was expensive and time-consuming to create. It generated some terrible technical headaches, since the native XML search engines available in 2003 turned out not to be up to the task of dealing with such a large and complex database.[1] The initial solution involved using two separate search engines in tandem (Lucene for full text search and MySQL for statistical search), until they were finally fully integrated with the completion of the project in 2008 (and even that had its costs).

The full significance of the decision was not even immediately apparent. The multi-purpose nature of the resource as a source certainly was readily appreciated by a wide range of different users: family historians (especially once the post-1834 Proceedings went online), teachers and students, crime and legal historians, historians of material culture, Londoners who simply found reading the stories of their city’s past addictive, and many more. That’s a story that’s already well known, I think, and I hope will be highlighted again in this weekend’s anniversary blogging. It was already visible at the Tales from the Old Bailey conference in 2004, and can be seen in the growing list of publications citing the OBO. Digitisation gave this primary source a whole new lease of life.[2]

The more unexpected tales of the Old Bailey Online that I want to highlight here came about largely because of that fortuitous decision to produce a full, accurate, marked-up text. What had been created was not simply a digital surrogate of a primary source, which humans could surf and search with their web browsers. It was also data: it could be read, and manipulated, and analysed, by machines. As a result, it had the potential to be re-used in ways that went far beyond its creators’ research agendas and even their ambitious visions for opening up access to ‘history from below’.

The first datamining efforts began some time around 2005. An early project was a collaboration between the OBO project staff and members of the University of Sheffield Computer Science department: Armadillo, a textmining/semantic web tool, using the OBO dataset among other 18th-century London datasets. It wasn’t entirely successful, and it seemed to drive most of the people involved to distraction, but it did experiment with techniques that would become increasingly important in our projects, especially Natural Language Processing for automating semantic markup (an important part of London Lives) and distributed search.

Another thread began with some email conversations between Tim Hitchcock and Bill Turkel in about 2005/6. In the summer of 2006, one of Bill’s graduate students, Rebecca Woods, undertook a small textmining project, scraping and analysing trials from the Proceedings with fairly basic Perl scripts. (The code she wrote is still available but would need changes to the site URLs to work.) A couple of years later, armed with the newly completed full set of XML files for 1674-1913, Bill wrote his Naive Bayesian in The Old Bailey series (and a subsequent presentation at the 2008 project conference). This extensive demonstration of the possibilities of machine learning as a historical research tool paved the way for the international collaborative project Datamining With Criminal Intent in 2009-11.[3]

I think that 2008-9 was also roughly when we started talking a lot about APIs (even if we didn’t all know exactly what they were) and worrying about the “silo effect” of disconnected digital resources. The main Sheffield-based project to come out of that was Connected Histories (which has also led to Manuscripts Online, a medieval manuscripts project using the same methodology). We weren’t the only people thinking about massive federated search engines though, and the Old Bailey Online data can now also be searched through NINES and 18th Connect.

But perhaps the most unexpected tales of all come from a quite different discipline: historical linguistics. Our list of publications citing the OBO points to some of the research going on, and at least part of that probably uses the work of Magnus Huber. This goes back to 2004, when Magnus was looking online for potential sources, and stumbled on the Old Bailey Online. The process of transforming the XML dataset into a linguistic corpus involved identifying and tagging direct speech in the trial reports, “part-of-speech” (POS) tagging, and finally compiling The Old Bailey Corpus, which includes “407 Proceedings, ca. 318,000 speech events, ca. 14 million spoken words, ca. 750,000 spoken words/decade)”.[4]

A cautionary note, perhaps, at this point. Tim Hitchcock worries a bit about the (growing) move towards ‘Big Data’ approaches in Digital Humanities/History:

One problem is that these new methodologies are and will continue to be reasonably technically challenging. If you need to be command-line comfortable to do good history – there is no way the web resources created are going to reach a wider democratic audience, or allow them to create histories that can compete for attention with those created within the academy – you end up giving over the creation of history to a top down, technocratic elite.

So, yes, we should be creating interfaces, like that of Locating London’s Past, or the OBAPI Demonstrator, that enable people without specialist skills to explore the OBO in new ways. But at the same time, I think that opening up the Old Bailey Online data to those who do have more technical skills is crucial for continuing to widen the reach of our project. Users of the website in the past have often written to us, frustrated by the limitations of the search facilities we can provide, and they have been willing to take on that challenge to make it possible to do their own thing. Yes, those people have tended to be from universities (often resourceful and enthusiastic postgraduate students) but there’s no inherent reason for that always to be the case.

As scientists sometimes remind us humanists, this isn’t really Big Data at all. We shouldn’t exaggerate; the OBO dataset doesn’t demand supercomputers, eye-poppingly expensive software, or teams of professional data scientists and programmers, all of which are rather larger barriers to democratic knowledge than learning Python. In any case, the barriers keep shifting and getting smaller: as Mark Liberman has said, “the first bible concordance took thousands of monk-years to compile; today, any bright high school student with a laptop can do better in a few hours”.

Before 2003, as far as Magnus Huber knows, no linguist had ever looked at the Proceedings or thought of them as a potential corpus; the printed volumes were simply not suited to this kind of work (besides which, he notes, the 18th-19th centuries were a relatively neglected period in historical linguistics). He also believes that the Old Bailey Corpus is the first sociohistorical corpus to have been compiled entirely from an electronic version of a historical source, using its markup in a systematic and (semi-)automated way, rather than compiling manually from print editions or manuscripts.[5]

I want us to spend the next 10 years making the OBO data as accessible as possible, in as many ways as possible, to as many people as possible. I want to know what else it has to tell that no one has thought of asking yet.

Selfishly enough, I just want to keep being surprised.

——————————-

[1] This was before my time, I should note: Tim and Bob wrote about some of the early decisions and struggles in ‘Digitising History From Below: The Old Bailey Proceedings Online, 1674-1834′, History Compass 4 (2006).  (OA version)

[2] We documented some of it in our 2011 impact analysis.

[3]Another of Bill’s ex-students, Adam Crymble, has yet to make his escape from OBO’s clutches.

[4] See also this article (2007) and the podcast of a seminar paper Magnus gave at the IHR Digital history seminar in February 2012.

[5] This is from email correspondence with Magnus, who very generously answered a barrage of questions out of the blue.

3 thoughts on “Tales of the Unexpected: or, what can happen when you let a bunch of criminals loose on the Internet”

  1. Oh, and I left out of this a number of rather less exciting but crucially important ways in which we’ve frequently re-used the original OB data in other projects. The tagged names and places have been used time and again in our dictionaries for NLP/entity recognition processing – the 1834-1913 Proceedings, London Lives, Connected Histories, Locating London’s Past, Manuscripts Online (and probably quite a few other HRI digital projects too); all of these keep building on each other; none could exist as they are now without that very first iteration of OBO 1674-1834.

  2. I’m the trustee of a nonprofit which is looking at the costs/benefits of encoding a large text archive into XML (possible TEI) versus continuing with the current relational database approach we have. Would love to discover more technical details of what you learned doing this for the Old Bailey Papers and what you would recommend now in terms of database choices, search implementation, etc. We have the advantage of already having fully digital text, but also the additional challenge of needing to link audio/video assets for some text assets…

  3. I’m not myself fully au fait with all the technical possibilities, I should say, as I’m not a programmer. (I don’t know if there have been any significant advances in XML database technology in recent years.) Perhaps I tend to think of XML itself more as a storage format that you can then convert into other formats eg for text processing or search. I suppose I’m wondering what sort of text archive this is, and if it’s already in a database, what do you hope to gain by the conversion – what is it you want to do with it in XML that you can’t already do? (Is it searchable online already?) Do you want to add semantic markup, for example? Do you want to turn it into Linked Open Data?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s