Category Archives: Old Bailey Online

Data And The Digital Panopticon

Originally posted on Criminal Historian:

The view from my seat at the DP data visualisation workshop
The view from my seat at the DP data visualisation workshop

Yesterday, I went to All Souls College, Oxford, for a data visualisation workshop organised by the Digital Panopticon project.

The project – a collaboration between the Universities of Liverpool, Sheffield, Oxford, Sussex and Tasmania – is studying the lives of over 60,000 people sentenced at the Old Bailey between 1780 and 1875, to look at the impact of different penal punishments on their lives.

It aims to draw together genealogical, biometric and criminal justice datasets held by a variety of different organisations in Britain and Australia to create a searchable website that is aimed at anyone interested in criminal history – from genealogists to students and teachers, to academics.

This is a huge undertaking, and it is no wonder that the project aims to harness digital technologies in making the material accessible to a wide audience. But how could…

View original 530 more words

New project, new people: the Digital Panopticon

Starting a new project is exciting and intensely busy (which is also my excuse for taking a month to blog about it). And the Digital Panopticon is the biggest one we’ve done yet.

‘The Digital Panopticon: The Global Impact of London Punishments, 1780-1925’ is a four-year international project that will use digital technologies to bring together existing and new genealogical, biometric and criminal justice datasets held by different organisations in the UK and Australia in order to explore the impact of the different types of penal punishments on the lives of 66,000 people sentenced at The Old Bailey between 1780 and 1925 and create a searchable website.

The Panopticon, for anyone who doesn’t know, was a model prison proposed by the philosopher Jeremy Bentham (1748-1832): “a round-the-clock surveillance machine” in which prisoners could never know when they were being watched. In Bentham’s own words: “a new mode of obtaining power of mind over mind, in a quantity hitherto without example”. Although Bentham’s plan was rejected by the British government at the time, there were later prisons built along those lines (Wikipedia), and the panopticon has become a modern symbol of oppressive state surveillance and social control.

Bentham criticised the penal policy of transportation and argued that confinement under surveillance would prove a more effective system of preventing future offending. One of DP’s basic themes is to test his argument empirically by comparing re-offending patterns of those transported and imprisoned at the Old Bailey. But it will go further, to compare the wider social, health, generational impacts of the two penal regimes into the 20th century.

Technically, DP brings together a number of different methods/techniques we’ve worked on in various projects over the years: digitisation, record linkage, data mining and visualisation, impact, connecting and enhancing resources, with the goal of developing “new and transferable methodologies for understanding and exploiting complex bodies of genealogical, biometric, and criminal justice data”.

However, it’s a much more research-intensive project than the ones we’ve done recently, and that’s reflected in the depth and breadth of the seven research themes. These are based on three central research questions/areas:

  • How can new digital methodologies enhance understandings of existing electronic datasets and the construction of knowledge?
  • What were the long and short term impacts of incarceration or convict transportation on the lives of offenders, and their families, and offspring?
  • What are the implications of online digital research on ethics, public history, and ‘impact’?

What’s also exciting (and new for us) is that we’ll have PhD students as well as postdoc researchers (adverts coming soon). Lots of PhD students! Two are part of the AHRC funding package – one at Liverpool and one at Sheffield – and the partner universities have put up funding for several more (two each at Liverpool and Sheffield and one at Tasmania, I think).

The first at Sheffield has just been advertised and the deadline is 2 December (to start work in February 2014):

The Social and Spatial Worlds of Old Bailey Convicts 1785-1875

The studentship will investigate the social and geographical origins and destinations of men and women convicted at the Old Bailey between 1785 and 1875, in order to shed light on patterns of mobility and understandings of identity in early industrial Britain. Using evidence of origins from convict registers and social/occupational and place labels in the Proceedings, the project will trace convicts from their places of origin through residence and work in London before their arrests, to places of imprisonment and subsequent life histories. Analysis of the language they used in trial testimonies will provide an indication of how identities were shaped by complex backgrounds.

Spread the word – and watch this space (and the project website) for more announcements soon!

PS: the project is on Twitter: follow at @digipanoptic

Collaboration and crowdsourcing for Old Bailey Online and London Lives

My digital crime history talk included some mention of ‘crowd sourcing’ and our stuttering efforts in this direction (on various projects) over the last five years or so. This post is intended as a marker to get down some further thoughts on the subject that I’ve been mulling over recently, to start to move towards more concrete proposals for action.

Two years ago, we added to OBO a simple form for registered users to report errors: one click on the trial page itself. People have been using that form not simply to report errors in our transcriptions but to add information and tell us about mistakes in the original. The desire on the part of our site users to contribute what they know is there.

We now have a (small but growing) database of these corrections and additions, which we rather failed to foresee and currently have no way of using. There is some good stuff! Examples:

t18431211-307    The surname of the defendents are incorrect. They should be Jacob Allwood and Joseph Allwood and not ALLGOOD

t18340220-125    The text gives an address Edden St-Regent St. I believe the correct street name is Heddon St, which crosses Regent St. There is no Edden St nearby. There is an Eden St and an Emden St nearby, but neither meet Regent St.

t18730113-138    The surname of the defendant in this case was Zacharoff, not Bacharoff as the original printed Proceedings show. He was the man later internationally notorious as Sir Basil Zaharoff, the great arms dealer and General Representative abroad of the Vickers armaments combine. See DNB for Zaharoff.

t18941022-797    Correct surname of prisoner is Crowder see Morning Post 25.10.1894. Charged with attempted murder  not murder see previous citation.

It also bothers me, I’d add, that there’s no way of providing any feedback (let alone encouragement or thanks). If I disagree with a proposed correction, I don’t have a way to let the person reporting the issue know that I’ve even looked at it, let alone explain my reasoning  (someone suggested, for example, that the murder of a two-year-old child ought to be categorised as ‘infanticide’, but we use that term only for a specific form of newborn infant killing that was prosecuted under a particular statute during the period of the Proceedings).

On top of which, I think it’s going to become an increasing struggle to keep up even with straightforward transcription corrections because the method we’ve always used for doing this now has considerably more friction built in than the method for reporting problems!

So, the first set of problems includes:

  • finding ways to enable site users to post the information they have so that it can be added to the site in a useful way (not forgetting that this would create issues around security, spam, moderation, etc)
  • improving our own workflow for manual corrections to the data
  • solving a long-standing issue of what to do about names that were wrongly spelt by the reporters or have variant spellings and alternatives, which makes it hard for users to search for those people
  • maybe also some way of providing feedback

A possible solution, then, would be a browser-based collaborative interface (for both Old Bailey Online and London Lives), with the facility to view text against image and post contributions.

  • It should be multi-purpose, with differing permissions levels for project staff and registered users.
  • Corrections from users would have to be verified by admin staff, but this would still be much quicker and simpler than the current set-up.
  • But it would be able to do more than just corrections – there would be a way of adding comments/connections/annotations to trials or documents (and to individual people).

A rather different and more programmatic approach to (some of) the errors in the OBO texts than our individualised (ie, random…) and manual procedures was raised recently by Adam Crymble.

For such a large corpus, the OBO is remarkably accurate. The 51 million words in the set of records between 1674 and 1834 were transcribed entirely manually by two independent typists. The transcriptions of each typist was then compared and any discrepancies were corrected by a third person. Since it is unlikely that two independent professional typists would make the same mistakes, this process known as “double rekeying” ensures the accuracy of the finished text.

But typists do make mistakes, as do we all. How often? By my best guess, about once every 4,000 words, or about 15,000-20,000 total transcription errors across 51 million words. How do I know that, and what can we do about it?

… I ran each unique string of characters in the corpus through a series of four English language dictionaries containing roughly 80,000 words, as well as a list of 60,000 surnames known to be present in the London area by the mid-nineteenth century. Any word in neither of these lists has been put into a third list (which I’ve called the “unidentified list”). This unidentified list contains 43,000 unique “words” and I believe is the best place to look for transcription errors.

Adam notes that this is complicated by the fact that many of the ‘errors’ are not really errors; some are archaisms or foreign words that don’t appear in the dictionaries, and some (again) are typos in the original.

Certain types of error that he identified could potentially be addressed with an automated process, such as the notorious confusion of the long ‘S’ with ‘f’: “By writing a Python program that changed the letter F to an S and vise versa, I was able to check if making such a change created a word that was in fact an English word.”

But any entirely automated procedure would inevitably introduce some new errors, which we’re obviously reluctant to do (being pedantic historians and all that). So what to do?

Well, why not combine the power of the computer and the ‘crowd’? We could take Adam’s ‘unidentified list’ as a starting point, so that we’re narrowing down the scale of the task, and design around it a specific simplified and streamlined corrections process, within the framework of the main user interface. My initial thoughts are:

  • this interface would only show small snippets of trials (enough text around the problem word to give some context) and highlight the problem word itself alongside the page image (unfortunately, one thing we probably couldn’t do is to highlight the word in the page image itself)
  • it would provide simple buttons to check for a) the dictionary version or a letter switch, or b) the transcription is correct; with c) a text input field as a fallback if the correction needed is more complex; hopefully most answers would be a or b!
  • if at least two (maybe three?) people provide the same checkbox answer for a problem word it would be treated as a verified correction (though this could be overruled by a project admin), while text answers would have to go to admins for checking and verification in the same way as additions/corrections submitted in the main interface.
  • we should be able to group problems by different types to some extent (eg, so people who wanted to fix long S problems could focus on those)

Suggestions from people who know more than me about both the computing and the crowdsourcing issues would be very welcome!

Bloody Code: reflecting on ten years of the Old Bailey Online and the digital futures of our criminal past

Talk given at Our Criminal Past: Digitisation, Social Media and Crime History Workshop, London Metropolitan Archives, 17 May 2013

My academic apprenticeship, in Aberystwyth, was spent engrossed in two things: first, early modern Welsh and northern English crime archives, and second, the potential of the Internet for research and teaching and simply opening up early modern history to as many people as possible. That wasn’t a completely respectable interest back in 1999, and I’m still amazed sometimes that I’ve been able to spend I’ve spent the last 7 years indulging shamelessly in that obsession and get paid for it.

But what about the first of my obsessions? A couple of weeks ago, the Financial Times told us that more cranes have been erected in London in the past 3 years than everywhere else in the UK put together. I have a nagging worry that I’ve unwittingly contributed to a similar situation in the digital sphere.

I’ve found at least 300 scholarly publications citing OBO, so it’s certainly made its mark on academic research. Beyond academia, it’s directly generated family histories, novels, radio and TV dramas and documentaries. But what impact has it had on digitising crime history? 10 years on, vast swathes of our criminal records remain untouched by the digital. And while there has been large-scale digitisation of sources that crime historians use, not much of it is freely accessible, and little of it has been done by or for us.

A number of historians over the years have worried that OBO skews attention – and resources – disproportionately towards London and the higher courts, representing a tiny minority of prosecuted crimes and policing. As the digital historian and project manager, I’m thrilled to learn of young researchers who chose history of crime because of OBO. But my other half, the archives researcher, is more ambivalent.

ASSI 45/1/3 information
TNA ASSI 45/1/3

Early modern court archives aren’t like our neatly packaged, readable trial reports. They’re unwieldy, often dirty, fragmentary, intimidating in overall scale. Documents vary hugely in size, structure, handwriting, materials used and condition, defying any ‘one size fits all’ approach to digitising. They’re frequently written in heavily abbreviated Latin, or ponderously legalese-d English, or an unholy mix of both.

Who would want to struggle with that if they can use something like OBO instead? Would I, if I were a PhD student now? And how much easier is it to turn to OBO for immediate digital rewards than to start new digitisation projects with such awkward and intractable material?

I was asked to introduce themes and challenges that I think are important for the future of digital crime history. So here’s the first challenge: improving digital access to documents like this, and the hundreds of thousands like them in our archives. A second challenge: as always, how to pay for it and sustain it in the long term. And a third is the digital skills we need: I don’t mean necessarily programming, but understanding something about various kinds of code, how to work with digital data, how to work with people who do programme.

And then there are two themes I want to emphasise, that can help us to face the challenges: the need to re-use, recycle and share digital content; and the importance of collaboration and partnerships.

Re-use

Trial of Bridget Callahan, 1760
Trial of Bridget Callahan, 1760

I’ve blogged recently about the dual identity of the Proceedings; ideal for quantitative analysis, which needs a structured database; but also containing many rich, engaging witness narratives that demanded full text. The solution found in OBO’s case was to transcribe using a double-rekeying process that’s less accurate than traditional standards for scholarly editions, but far more accurate than OCR, and then mark up transcriptions with XML tags to create structure that can be extracted and turned into a database.

Bridget's trial in TEI-XML
Bridget’s trial in TEI-XML

There are certainly downsides to this: time-consuming, expensive, unwieldy. [Both Tim and I agreed in the subsequent discussion that we wouldn't try to do it quite like that today, though I'm not sure we'd be in complete agreement on exactly what we would do instead...]

But the upsides: accuracy, completeness, versatility.

Having created our digital data, it can manipulated and re-used in many ways. Convert it into other formats. Index it in different ways for different kinds of search. And even transform it with new markup for different purposes, as in Magnus Huber’s Old Bailey Corpus Online. There have been uses of OBO that no one could have predicted.

Same data, four ways
Same data, four ways

Bridget: Searching in London Lives, Connected Histories, 18th-Connect (are those other results the same person?); a dot somewhere in this graph from Datamining with Criminal Intent. Same data in four places: making new connections, seeing trials in different ways.

I’d argue there are two lessons from OBO for everyone, whatever kind of project or source they have in mind:

  1. digitise in a way that best captures the information in a source;
  2. & which facilitates future re-use and collaboration

Not the specifics of transcription or markup or any particular search engine. Given that many crime history sources are heavily formulaic, or in Latin before 1733, sometimes verbatim transcription can hide as much as it reveals, make it harder to find the useful stuff. Some – many – of our sources simply don’t have rich stories to tell like OBO.

Creating data that is clean and consistent, well-structured and accurately documented may cost more at the beginning, it may require more of an investment in technical skills and management, but it will make your efforts worth more in the long run.

Partnerships

What kind of collaborations and partnerships do we need? Let’s start with the vital one: the relationship between the historians and the keepers of the archival documents. Well, I admit I’m worried about that relationship. Here’s one reason why.

findmypast/TNA www.findmypast.co.uk/search/crime-prison-punishment
findmypast/TNA http://www.findmypast.co.uk/search/crime-prison-punishment

Why does this resource trouble me? It’s not just because it’s behind a paywall.

OK, in an ideal world, all these resources would be freely accessible to all. But I know all too well how expensive digitisation is. Someone has to pay; it’s just a question of how. The grim reality is that archives and libraries are under intense financial pressure and it’s only going to get worse: and that one of the few reliable paying audiences outside academia is family historians.

And findmypast have made a great, affordable, resource for family historians. But it’s a terribly limited one for crime historians. It’s a name search; as far as I can tell, no separate keyword search or browse. (If I’m wrong, there’s nothing telling me so.)* The needs and priorities of family historians and academics overlap, but they’re not close enough that creating resources that can serve them both well just happens.

Then you have, say, Eighteenth Century Collections Online or British Newspapers 1600-1900, which are designed more for academic audiences, but virtually inaccessible to individuals outside academic institutions, and those whose institutions can’t afford the price tag. And even then pretty much all you can do with those is keyword search and hope what you want isn’t lost in garbled OCR text.

Both kinds of resource are black boxes that make it impossible for a researcher to evaluate the quality of the data or search results; and hinder any kind of use other than those the platform was specifically built for. And if the data is locked away in a box it can never be corrected or improved or enhanced – even though the technology to enable that is continually developing. So publishers lose out, in the end, too.

Are there alternatives to the black box?

Text Creation Partnership
Text Creation Partnership

The Text Creation Partnership is funded by a group of libraries led by Michigan. It’s transcribing content from major commercial page-image digitised collections. The resulting texts are restricted to partnership members and resource subscribers for 4 or 5 years and then released into the public domain.

The images continue to be behind the paywall, and not all texts are transcribed. For ECCO the proportion is small, but EEBO’s goal is much more ambitious: one transcription for each unique text (usually first editions). In January 2015, 25,000 phase 1 EEBO texts will become available to everyone for search and to download for textmining or whatever else we can think of by then. (Phase 2 in ?2019 will be something like another 40,000 texts.)

It surely is not beyond our wit to translate that kind of public-private collaborative model to crime records [suggestion here], and for that matter, other archival records with overlapping academic/family history user groups. But to do so, I think we need to build partnerships between historians, archivists and publishers much more than we’ve been doing. And if what you want is totally free-to-access resources, you still need to work with archivists to find answers to the ‘who pays?’ question. I hope today can be a good starting place.

‘The Crowd’

A few good collaborations
A few good collaborations

But it’s not enough simply to think about institutional collaboration.

A lot of smart people are thinking very hard about ways to facilitate collaborative user participation in digital resources – transcription, indexing, correction, tagging, annotation, linking, and they’re building tools whose usefulness often isn’t confined to volunteer projects and ‘crowd sourcing’. (The re-use maxim applies here too: don’t build from scratch if other people have already done the hard work building and testing good tools.)**

However, don’t imagine ‘the crowd’ is an easy option. OBO has been trying it out for a while and we’re only sort of getting there.

Part of our problem, on reflection, has been adding these things near the end of a project when we launch a website and then hope for something to magically happen while we go off to the next project.

A second issue has been user interfaces and design. We took a while to learn that we have to make participation easy, really easy, and we have to build the design in from early on. It’s no good building something that needs a separate login from the rest of the site, with a flaky user database that was tacked on as an afterthought. Third, and related: understand the limits of what most users are willing to do.

Two years ago, we added to OBO a simple form for registered users to report errors: one click on the trial page itself. People have been using that form not simply to report errors in our transcriptions but to add information and tell us about mistakes in the original. The desire on the part of our site users to contribute what they know is there. Just don’t think that means there’s a ready-made ‘crowd’ waiting to turn up and help you out without plenty of effort on your part.

Historians and our dirty data

And what about us, the historians who have been or are working in the archives? We’re all digitizers now, and have been for a long time. Well, sort of. My computer has folders of databases, transcriptions and (to use the technical term) “stuff” that is kept from public view because, well, it’s a mess and I never get around to the data cleaning needed and it would be embarrasing to let people see my mistakes. I’m sure I’m not alone. [There were nods and sheepish grins all round at this point. You know who you are.]

Increasingly in future, there are going to be requirements from funders to share research data in institutional repositories and the like. We should not be assuming that means just scientists! We shouldn’t in any case be doing this just because someone demanded it; it should become a habit, the right thing to do to help each other.

But we need to get the right training in digital skills for students, so they know how to make good, shareable data, and how best to re-use data shared by others. (Full disclosure: I’m working on a project to create an online data management course for historians at the moment…)

Digitisation, digital history and re-usability don’t have to be all about big funded projects. It can start with personal decisions and actions: clean up your old data, put it in your institutional repository, share it with a Creative Commons licence, tell your colleagues and students it’s there. Relinquish some control. [more thoughts on this]

If we digitise for re-use, and re-use to digitise, we can share and collaborate, and build partnerships that can make some of the challenges of digitisation less intimidating. Digital history should be an iterative, accumulative, learning process rather than one-off ‘projects’ to be ‘launched’ and then left to gather dust.

—-

* The findmypast resource has only gone up recently and the content isn’t complete yet. Keyword search functionality is apparently supposed to be included in the resource and it’s possible that will become available as it rolls out. But it should be noted that even a keyword search is unlikely to fulfil the needs that crime historians often have to crunch numbers in complex ways.

** The tools and projects in this slide really are a tiny sample of what’s happening now. They are:

(If you’re on Twitter and you want to know more about this, there are three people I think you really ought to be following as a starting point: @benwbrum, @mia_out and @ammeveleigh.)

Tales of the Unexpected: or, what can happen when you let a bunch of criminals loose on the Internet

One day towards the end of the last millennium, a pair of historians of early modern London hatched a crazy plan to digitise a massive and obscure (to everyone except a few academic crime and legal historians) primary source, published between the 1670s and 1913, and known variously as the Old Bailey Sessions Papers or Old Bailey Proceedings. Part of the challenge, apart from its sheer volume, was that they wanted to capture two very different kinds of information. The consistent format of the Proceedings and the fact that for much of its existence it had been a quasi-official record of all the trials held at the court made it an ideal candidate for a structured database approach that would enable long-term quantitative analyses. But at the same time the trial reports possessed many rich, engaging witness narratives that could only be truly represented by full text digitisation.

This dual identity was resolved by creating full text transcriptions – rekeyed by humans rather than OCRed – that were tagged with XML for database structure. This was a crucially important decision (‘even if it was through luck rather than expertise’). It did have its downsides. It was expensive and time-consuming to create. It generated some terrible technical headaches, since the native XML search engines available in 2003 turned out not to be up to the task of dealing with such a large and complex database.[1] The initial solution involved using two separate search engines in tandem (Lucene for full text search and MySQL for statistical search), until they were finally fully integrated with the completion of the project in 2008 (and even that had its costs).

The full significance of the decision was not even immediately apparent. The multi-purpose nature of the resource as a source certainly was readily appreciated by a wide range of different users: family historians (especially once the post-1834 Proceedings went online), teachers and students, crime and legal historians, historians of material culture, Londoners who simply found reading the stories of their city’s past addictive, and many more. That’s a story that’s already well known, I think, and I hope will be highlighted again in this weekend’s anniversary blogging. It was already visible at the Tales from the Old Bailey conference in 2004, and can be seen in the growing list of publications citing the OBO. Digitisation gave this primary source a whole new lease of life.[2]

The more unexpected tales of the Old Bailey Online that I want to highlight here came about largely because of that fortuitous decision to produce a full, accurate, marked-up text. What had been created was not simply a digital surrogate of a primary source, which humans could surf and search with their web browsers. It was also data: it could be read, and manipulated, and analysed, by machines. As a result, it had the potential to be re-used in ways that went far beyond its creators’ research agendas and even their ambitious visions for opening up access to ‘history from below’.

The first datamining efforts began some time around 2005. An early project was a collaboration between the OBO project staff and members of the University of Sheffield Computer Science department: Armadillo, a textmining/semantic web tool, using the OBO dataset among other 18th-century London datasets. It wasn’t entirely successful, and it seemed to drive most of the people involved to distraction, but it did experiment with techniques that would become increasingly important in our projects, especially Natural Language Processing for automating semantic markup (an important part of London Lives) and distributed search.

Another thread began with some email conversations between Tim Hitchcock and Bill Turkel in about 2005/6. In the summer of 2006, one of Bill’s graduate students, Rebecca Woods, undertook a small textmining project, scraping and analysing trials from the Proceedings with fairly basic Perl scripts. (The code she wrote is still available but would need changes to the site URLs to work.) A couple of years later, armed with the newly completed full set of XML files for 1674-1913, Bill wrote his Naive Bayesian in The Old Bailey series (and a subsequent presentation at the 2008 project conference). This extensive demonstration of the possibilities of machine learning as a historical research tool paved the way for the international collaborative project Datamining With Criminal Intent in 2009-11.[3]

I think that 2008-9 was also roughly when we started talking a lot about APIs (even if we didn’t all know exactly what they were) and worrying about the “silo effect” of disconnected digital resources. The main Sheffield-based project to come out of that was Connected Histories (which has also led to Manuscripts Online, a medieval manuscripts project using the same methodology). We weren’t the only people thinking about massive federated search engines though, and the Old Bailey Online data can now also be searched through NINES and 18th Connect.

But perhaps the most unexpected tales of all come from a quite different discipline: historical linguistics. Our list of publications citing the OBO points to some of the research going on, and at least part of that probably uses the work of Magnus Huber. This goes back to 2004, when Magnus was looking online for potential sources, and stumbled on the Old Bailey Online. The process of transforming the XML dataset into a linguistic corpus involved identifying and tagging direct speech in the trial reports, “part-of-speech” (POS) tagging, and finally compiling The Old Bailey Corpus, which includes “407 Proceedings, ca. 318,000 speech events, ca. 14 million spoken words, ca. 750,000 spoken words/decade)”.[4]

A cautionary note, perhaps, at this point. Tim Hitchcock worries a bit about the (growing) move towards ‘Big Data’ approaches in Digital Humanities/History:

One problem is that these new methodologies are and will continue to be reasonably technically challenging. If you need to be command-line comfortable to do good history – there is no way the web resources created are going to reach a wider democratic audience, or allow them to create histories that can compete for attention with those created within the academy – you end up giving over the creation of history to a top down, technocratic elite.

So, yes, we should be creating interfaces, like that of Locating London’s Past, or the OBAPI Demonstrator, that enable people without specialist skills to explore the OBO in new ways. But at the same time, I think that opening up the Old Bailey Online data to those who do have more technical skills is crucial for continuing to widen the reach of our project. Users of the website in the past have often written to us, frustrated by the limitations of the search facilities we can provide, and they have been willing to take on that challenge to make it possible to do their own thing. Yes, those people have tended to be from universities (often resourceful and enthusiastic postgraduate students) but there’s no inherent reason for that always to be the case.

As scientists sometimes remind us humanists, this isn’t really Big Data at all. We shouldn’t exaggerate; the OBO dataset doesn’t demand supercomputers, eye-poppingly expensive software, or teams of professional data scientists and programmers, all of which are rather larger barriers to democratic knowledge than learning Python. In any case, the barriers keep shifting and getting smaller: as Mark Liberman has said, “the first bible concordance took thousands of monk-years to compile; today, any bright high school student with a laptop can do better in a few hours”.

Before 2003, as far as Magnus Huber knows, no linguist had ever looked at the Proceedings or thought of them as a potential corpus; the printed volumes were simply not suited to this kind of work (besides which, he notes, the 18th-19th centuries were a relatively neglected period in historical linguistics). He also believes that the Old Bailey Corpus is the first sociohistorical corpus to have been compiled entirely from an electronic version of a historical source, using its markup in a systematic and (semi-)automated way, rather than compiling manually from print editions or manuscripts.[5]

I want us to spend the next 10 years making the OBO data as accessible as possible, in as many ways as possible, to as many people as possible. I want to know what else it has to tell that no one has thought of asking yet.

Selfishly enough, I just want to keep being surprised.

——————————-

[1] This was before my time, I should note: Tim and Bob wrote about some of the early decisions and struggles in ‘Digitising History From Below: The Old Bailey Proceedings Online, 1674-1834′, History Compass 4 (2006).  (OA version)

[2] We documented some of it in our 2011 impact analysis.

[3]Another of Bill’s ex-students, Adam Crymble, has yet to make his escape from OBO’s clutches.

[4] See also this article (2007) and the podcast of a seminar paper Magnus gave at the IHR Digital history seminar in February 2012.

[5] This is from email correspondence with Magnus, who very generously answered a barrage of questions out of the blue.

Old Bailey Online Update

I posted a few months ago about the Crime in the Community project for Old Bailey Online, and we brought the work to completion last week. This has been a relatively small but really satisfying project (I would have written about it earlier, but was kept busy by the Connected Histories launch on Thursday).

The project started last October with funding from the JISC Impact and Embedding of Digitised Resources Programme. We carried out a rapid(ish) user impact analysis, which was an entirely new but rewarding experience (our report can be downloaded from here). With the aid of the rather awesome Toolkit for the Impact of Digitised Scholarly Resources, this included analysis of site traffic and incoming links, bibliometrics, an online questionnaire, interviews and focus groups.

Previously we had only a rough sense of the ways the site was being used, even in terms of visitor traffic. We learned a lot in the process and it helped us to decide on the new features and functionality to add to the site. So, here are the important bits:

User registration/workspace

  • User workspaces: bookmark trials etc, save search queries, organise them in folders
  • Facility to export the information saved in the workspace
  • User registration for London Lives and Old Bailey Online is integrated so that only one account is needed for both sites
  • Registered users will also be able to report errors through a corrections facility integrated into the site

Extracting information from the site into other formats

  • Citation generator for trials etc and background pages
  • ‘Print page’ function (with citation) for trials etc to enable you to print a page or simply to copy and paste the text without the usual formatting and images
  • Facility to export raw data from statistics search results

Search improvements

  • Facility to refine searches
  • Keyword searches have a new set of options (and / or / phrase / ‘advanced’) to facilitate sophisticated searches while (hopefully) keeping it simple for most basic needs

Tutorials and Guides

Bibliographies

There has been one significant casualty of the project: after reviewing the user stats and responses in the survey and interviews, we decided to pull the Old Bailey Wiki. This meant that we would need to find another way to maintain the site bibliography. I used Zotero intensively while I was compiling a list of publications citing Old Bailey Online (mostly from Google Scholar and Google Books) for the impact analysis, and Zotero’s export facilities and collaborative tools seemed an almost obvious solution to the problem.

So, the ‘official’ Bibliography is now maintained in Zotero and you can view it here at the site. And, if you have a Zotero account, you can contribute items for updates to the Public Group Library. We would welcome your help to keep it updated!

And: good things still to come…

There were some popular requests among surveyed users for a number of features that weren’t feasible within the constraints of a short project. But quite a few of these should be satisfied by separate projects within the next few months:

First, an Old Bailey API is in development (as part of the Data Mining with Criminal Intent project) and will be launched in the next few months. This will facilitate more sophisticated searching options and facilites for extracting and downloading data for external analysis.

Second, we’ve started work on a new project, Locating London’s Past, which will improve the mapping features of the site (among quite a few other things!).

Crime in the Community

At the Old Bailey Online we have a pot of money to carry out an analysis of site users and add some enhancements to the site particularly aimed at academic researchers and teachers. The project runs until next March (more info here).

As part of this, we’ve posted a short online survey. If you’ve used Old Bailey Online for research or teaching, even if not very often, please fill in the survey – it shouldn’t take more than about 10 minutes and it will help us to decide how best to make use of the funding (and to set future priorities).

Old Bailey and Zotero

This should be of interest to many users of the Old Bailey Proceedings, especially teachers and researchers: you can now use the ‘one-click’ function in Zotero to bookmark documents on the site – when browsing, you’ll see the Zotero icons appear in the browser address bar.

It definitely works for single trials, full sessions and Ordinary’s Accounts, saving the key metadata for the page; it may also work in non-trial sections of sessions (adverts, supplementary material, etc), but I haven’t checked this. It also works from search results (but not stats and map searches); it’ll bring up a list of all the results on the page with checkboxes to save as many or few as you want.

(I believe that Adam Crymble should get the credit for writing the translator, but will happily correct this if I’m wrong!)