New Year, Old Stuff, Revamped: things in progress


1. Meet the new project, which also happens to be just about my oldest project: Gender and Defamation in York 1660-1700

The core of this is research I did way back in 1999 for my MA dissertation. It was the first archival research for which I had the use of a laptop, and I spent a couple of months transcribing cause papers in the old Borthwick Institute in the city of York (it nowadays has a much more modern home at the University of York), and creating a “database” of my 100 or so causes using the cutting-edge technology of 5 x 3 index cards.

The standard of the transcriptions was, well, about what you might expect of a student working for the first time in 17th-century legal archives, with a few months of beginners’ Latin and palaeography under her belt, and this put me off doing anything online with them for a long time. But I’ve been thinking about it on and off since the launch of the York Cause Papers Database in 2010 and subsequent mass digitisation of images. I’ve tinkered with the material from time to time, but not made much progress.

So rather than continue to keep it all under wraps until some mythical time in the future when it would be “ready”, I’ve decided to practise what I’ve been known to preach – put it online as a work in progress, and document revisions as I go along . Let’s see if putting it out there unfinished will help motivate me to get on with it at a slightly less glacial pace…

I’ve been following Michelle Moravec’s great ‘Writing in Public’ projects and her commitment ‘to making visible the processes by which history making takes place’. Well, the creation of historical data and digital resources is a process too, one that’s often obscured by the practice of launching finished projects with a great fanfare after months or years under wraps. Over in my paid job on the Digital Panopticon that’s something we’re aiming to avoid (watch this space…). So here goes!

The first stages of the project have involved making a useful resources: putting the causes into a database, linking through to the YCP database, keyword tagging, cross-referencing, and adding some links to background information. I’ve also put the data for the database and those crappy transcriptions on github.

Next steps:

  • get the uncorrected transcriptions into the database
  • start checking/correction (for those that have images available)
  • add more background resources (and integrate my existing defamation bibliography)
  • look at converting the thesis itself into a more web-friendly format, or perhaps turning it into shorter essays

Apart from finally sharing data I created such a long time ago, I hope this little project can do a number of useful things: showcase the York cause papers as a source, provide a useful resource for research into early modern defamation, slander, gossip and reputation, and encourage other researchers to do similar things with their old research stuff.

Record Linkage: project workshop and work in progress


We’re holding an afternoon workshop on record/data linkage in Sheffield on 4 November. The aim is to explore the challenges and rewards of applying automated nominal record linkage to large-scale historical datasets, with all their variability, fuzziness and uncertainties, but we’d also very much welcome participants and insights from all fields concerned with data linkage including social sciences, health sciences and computer science. In addition to presentations about our work in progress on 90,000 19th-century prisoners and convicts, we have guest speakers who will bring extensive experience of historical record linkage projects to the discussion. It’s free to attend and anyone with an interest, at any stage of their academic career, is welcome (I’d particularly love to see plenty of PhD students!). More info can be found on our website here (and there’s also a programme to download).

Record linkage is really at the heart of the Digital Panopticon project’s goals to explore the impact of the different types of punishments on Old Bailey Online defendants between about 1780 and 1875 (along with working on data visualisations for exploring, presenting and communicating the data and research findings). Our research questions include: How can we improve current record-linkage processes to maximise both the number of individuals linked across different datasets and the amount of information obtained about each individual? What is the minimum amount of contextual information needed in order to conduct successful large-scale record linkage of data pertaining to specific individuals?

I’ve blogged in the past about problems associated with historical record linkage where you don’t have handy unique IDs (like, say, National Insurance numbers): names are often crucial but are highly problematic, and problems with a source like Old Bailey Online that tells us about sentences but not actual punishments. Those are among our biggest headaches with Digital Panopticon.

There are a lot of missing people when we link OBO to transportation records, and a lot of possible reasons for linking to fail. There might be errors in the data created at almost any point between the making of the original source and our production of a specific dataset to feed to the computer: eg, if you’re extracting a London-only subset from a national dataset and you’re not careful, you might also end up with records from Londonderry. Oops. (“You” there is an euphemism for “I”. )

Then there are problems caused by spelling variations in names, or the use of aliases and different names. And the problem of common names. As I blogged before: “How do you decide whether one Robert Scott is the same person as another Robert Scott, or someone else altogether?” But that gets much worse when the name in question is “Mary Smith”.

And the fails that are due to the gaps in our data: Were they pardoned? Did they die in prison or on the hulks before they could be transported? And so we are on a quest to track down sources that can tell us these things and fill the gaps (not all of which have been digitised; some of which have probably not even survived, especially from the 18th century).

Irreconcilable conflicts can emerge between different sources (eg, different trial dates and places). At this point we have to turn to the specialist knowledge of the project team on how, when and where particular sources were created so we can attempt to rate the relative reliability of two conflicting sources. But how are we going to handle those weightings when we’re dealing with  thousands of people and the links are all probables anyway? (Just because source A is generally more reliable for a certain piece of information than source B doesn’t mean A is always right and B is always wrong if they’re in conflict.)

So there will be plenty to discuss at the workshop and for the next three years!

For tasters of what we’ve been getting up to so far:

Data And The Digital Panopticon


Originally posted on Criminal Historian:

The view from my seat at the DP data visualisation workshop The view from my seat at the DP data visualisation workshop

Yesterday, I went to All Souls College, Oxford, for a data visualisation workshop organised by the Digital Panopticon project.

The project – a collaboration between the Universities of Liverpool, Sheffield, Oxford, Sussex and Tasmania – is studying the lives of over 60,000 people sentenced at the Old Bailey between 1780 and 1875, to look at the impact of different penal punishments on their lives.

It aims to draw together genealogical, biometric and criminal justice datasets held by a variety of different organisations in Britain and Australia to create a searchable website that is aimed at anyone interested in criminal history – from genealogists to students and teachers, to academics.

This is a huge undertaking, and it is no wonder that the project aims to harness digital technologies in making the material accessible to a wide audience. But how could…

View original 530 more words

New project, new people: the Digital Panopticon


Starting a new project is exciting and intensely busy (which is also my excuse for taking a month to blog about it). And the Digital Panopticon is the biggest one we’ve done yet.

‘The Digital Panopticon: The Global Impact of London Punishments, 1780-1925’ is a four-year international project that will use digital technologies to bring together existing and new genealogical, biometric and criminal justice datasets held by different organisations in the UK and Australia in order to explore the impact of the different types of penal punishments on the lives of 66,000 people sentenced at The Old Bailey between 1780 and 1925 and create a searchable website.

The Panopticon, for anyone who doesn’t know, was a model prison proposed by the philosopher Jeremy Bentham (1748-1832): “a round-the-clock surveillance machine” in which prisoners could never know when they were being watched. In Bentham’s own words: “a new mode of obtaining power of mind over mind, in a quantity hitherto without example”. Although Bentham’s plan was rejected by the British government at the time, there were later prisons built along those lines (Wikipedia), and the panopticon has become a modern symbol of oppressive state surveillance and social control.

Bentham criticised the penal policy of transportation and argued that confinement under surveillance would prove a more effective system of preventing future offending. One of DP’s basic themes is to test his argument empirically by comparing re-offending patterns of those transported and imprisoned at the Old Bailey. But it will go further, to compare the wider social, health, generational impacts of the two penal regimes into the 20th century.

Technically, DP brings together a number of different methods/techniques we’ve worked on in various projects over the years: digitisation, record linkage, data mining and visualisation, impact, connecting and enhancing resources, with the goal of developing “new and transferable methodologies for understanding and exploiting complex bodies of genealogical, biometric, and criminal justice data”.

However, it’s a much more research-intensive project than the ones we’ve done recently, and that’s reflected in the depth and breadth of the seven research themes. These are based on three central research questions/areas:

  • How can new digital methodologies enhance understandings of existing electronic datasets and the construction of knowledge?
  • What were the long and short term impacts of incarceration or convict transportation on the lives of offenders, and their families, and offspring?
  • What are the implications of online digital research on ethics, public history, and ‘impact’?

What’s also exciting (and new for us) is that we’ll have PhD students as well as postdoc researchers (adverts coming soon). Lots of PhD students! Two are part of the AHRC funding package – one at Liverpool and one at Sheffield – and the partner universities have put up funding for several more (two each at Liverpool and Sheffield and one at Tasmania, I think).

The first at Sheffield has just been advertised and the deadline is 2 December (to start work in February 2014):

The Social and Spatial Worlds of Old Bailey Convicts 1785-1875

The studentship will investigate the social and geographical origins and destinations of men and women convicted at the Old Bailey between 1785 and 1875, in order to shed light on patterns of mobility and understandings of identity in early industrial Britain. Using evidence of origins from convict registers and social/occupational and place labels in the Proceedings, the project will trace convicts from their places of origin through residence and work in London before their arrests, to places of imprisonment and subsequent life histories. Analysis of the language they used in trial testimonies will provide an indication of how identities were shaped by complex backgrounds.

Spread the word – and watch this space (and the project website) for more announcements soon!

PS: the project is on Twitter: follow at @digipanoptic

A Zotero resource, and bibliographies online – revisited


Earlier this week, I led a one day course on using Zotero at the British Library (part of their Digital Scholarship training programme for staff) – many thanks to James Baker for the invitation.

It was a very hands-on course, starting with the assumption that most people there would never have used Zotero before, and gradually building up in difficulty. We packed a lot in in one day and the approach seemed to go down well.

James also generously agreed to me opening up the web resource I put together for the course (in PmWiki) for public consumption. It contains most of the exercises we worked through during the day – they are quite strongly BL-oriented, with plenty of my favourite topics (naturally…) but I think more generally applicable – as well as selected examples of the different kinds of things people and projects have done and are doing with Zotero – from teaching, group collaboration, research management, plugin development, publication, integration with other resources, and so on.

And so, here it is, under a Creative Commons license – use, re-use, mix, borrow and adapt if you’d find it useful!


Additionally, I found lots of interesting things while I was preparing the course, so I put them into a Zotero bibliography – well, what else?! – and made it into a public group, which Zotero users are very welcome to join and add to:

Managing Digital Research

I found myself answering the question “Why Zotero?” with some personal history, quite a bit of which was chronicled here on this blog over the years.  It occurred to me that I’ve been trying to manage references since my undergraduate dissertation more than 15 years ago, and I’ve been publishing bibliographies online for more than a decade (in the firm belief that it’s one of the most useful small things scholars can do for each other and for students). I’ve been through:

  • index cards (u/g and MA dissertations)
  • a homebrewed MS Access database (for my PhD secondary sources)
  • Endnote (for a while, but only because I got it cheap from my uni)
  • BibDesk (which I still use to some extent)
  • CiteULike
  • LibraryThing
  • Connotea, Mendeley, and probably other things of that ilk
  • Semantic Mediawiki (interesting but too much hard work)
  • wikindx (still in use, but probably phasing out soon)
  • Aigaion
  • and quite a few other things used so briefly I’ve forgotten them…

When I did my PhD research in the early 2000s, I put sources I wanted to quantify in an Access database; secondary references in another one; transcriptions in word documents (slightly later, they ended up in a different text database); all separate objects, hard to relate to each other. Even though most of my PhD sources haven’t been digitised (and probably never will be), today with Zotero I would approach much of that task quite differently. OTOH, my interest in references in recent years has more often been to do with how to publish large bibliographies online and keep them up to date. Well, Zotero covers that too.

So, for me Zotero has won the contest, hands down. A few of the tools listed can perhaps do specific things better than Zotero, and most of them are just as free (several are open source), but none of them is as versatile and powerful while being so easy to use and to customise. (Wikindx, for example, is excellent, but you need to be able to install a MySQL database and really to understand a bit about PHP and web apps.)

Zotero provides much more than just “reference management”. It isn’t just that you can quickly save and archive lots of different kinds of things you find online, but also that you can use it to manage research as a process, with changing needs over time – right through from collecting sources to analysis and writing and publishing.

In 2009, when Zotero was in its infancy – before much of its cloud and collaboration features existed, or they’d only just begun to develop – and I’d barely used it (just 42 items in my 3000+ library were added before 2010), I blogged about the impossibility of online collaborative bibliographies. Hahaha!

On Wednesday, I created a Zotero group live during the course (that took about a minute), and in the space of half an hour about six people, most of whom had never used Zotero at all before that day, put about 30 items in it, and added notes and attachments, ranging from news articles and reviews to youtube videos. (At the other end of the scale, of course, there are Zotero groups creating major resources for their communities.)

Sometimes it’s great to be proved so completely and utterly wrong.

Even in that 2009 post, I see that I added a comment wondering if Zotero could be the solution to the problem. Maybe, too, the discussion we had about the decision to turn the RHS British and Irish History bibliography into a subscription service could look very different now.

Digital history blogs


I compiled a quick list very recently for someone who was looking for introductions to digital history and people doing digital history work. And having done it, I thought I might as well share it.

Firstly: in one respect, this is a broad tent – some of these people are strictly speaking in literature or historical linguistics. But the boundaries are fuzzy, and what they’re doing is relevant to historians’ research too.

Secondly: but in another, it’s a fairly narrow subset of digital historians who blog – people who are posting about digital tools and techniques that they’re using, things they’re building, practical hacks and code, reflections on the process and the results they’re getting from doing those things.

Thirdly: it was put together very quickly from my RSS feeds and Twitter favourites. Who am I missing? (Feel free to plug your own blog.) What group blogs should be included?

  • Bill Turkel – “computational history, big history, STS, physical computing, desktop fabrication and electronics”
  • Tim Sherratt – “digital historian, web developer and cultural data hacker”; Invisible Australians
  • Adam Crymble – large-scale textual analysis; 18-19th century London
  • John Levin – mapping and visualisation; 18th-century London
  • Jean Bauer – database design and development; late 18th/early 19th-century USA
  • Jason Heppler – hacking/scripting (Ruby evangelist); 20th century USA
  • Caleb McDaniel – hacking/scripting; American abolitionism
  • Chad Black – hacking/scripting; early Latin America
  • Lincoln Mullen – databases, R; religion in 18-19th-century America
  • Fred Gibbs – mapping, metadata, textmining; medieval/early modern medicine and science
  • Jeri Wieringa – textmining; American religious history
  • Ben Brumfield – crowdsourced transcription software (software developer, family historian)
  • Heather Froehlich – corpus linguistics; early modern drama and gender (lit/lang)
  • Ted Underwood – “applying machine learning algorithms to large digital collections” (lit)

Not recently active so I nearly forgot about them…

Because I really do have a terrible memory (sorry…)

More via Twitter (thanks @paige_roberts, @wynkenhimself)

From comments (thanks!)

Labels are a bit random, I know: just for a flavour of what people do. Tidying up might happen later.

Collaboration and crowdsourcing for Old Bailey Online and London Lives


My digital crime history talk included some mention of ‘crowd sourcing’ and our stuttering efforts in this direction (on various projects) over the last five years or so. This post is intended as a marker to get down some further thoughts on the subject that I’ve been mulling over recently, to start to move towards more concrete proposals for action.

Two years ago, we added to OBO a simple form for registered users to report errors: one click on the trial page itself. People have been using that form not simply to report errors in our transcriptions but to add information and tell us about mistakes in the original. The desire on the part of our site users to contribute what they know is there.

We now have a (small but growing) database of these corrections and additions, which we rather failed to foresee and currently have no way of using. There is some good stuff! Examples:

t18431211-307    The surname of the defendents are incorrect. They should be Jacob Allwood and Joseph Allwood and not ALLGOOD

t18340220-125    The text gives an address Edden St-Regent St. I believe the correct street name is Heddon St, which crosses Regent St. There is no Edden St nearby. There is an Eden St and an Emden St nearby, but neither meet Regent St.

t18730113-138    The surname of the defendant in this case was Zacharoff, not Bacharoff as the original printed Proceedings show. He was the man later internationally notorious as Sir Basil Zaharoff, the great arms dealer and General Representative abroad of the Vickers armaments combine. See DNB for Zaharoff.

t18941022-797    Correct surname of prisoner is Crowder see Morning Post 25.10.1894. Charged with attempted murder  not murder see previous citation.

It also bothers me, I’d add, that there’s no way of providing any feedback (let alone encouragement or thanks). If I disagree with a proposed correction, I don’t have a way to let the person reporting the issue know that I’ve even looked at it, let alone explain my reasoning  (someone suggested, for example, that the murder of a two-year-old child ought to be categorised as ‘infanticide’, but we use that term only for a specific form of newborn infant killing that was prosecuted under a particular statute during the period of the Proceedings).

On top of which, I think it’s going to become an increasing struggle to keep up even with straightforward transcription corrections because the method we’ve always used for doing this now has considerably more friction built in than the method for reporting problems!

So, the first set of problems includes:

  • finding ways to enable site users to post the information they have so that it can be added to the site in a useful way (not forgetting that this would create issues around security, spam, moderation, etc)
  • improving our own workflow for manual corrections to the data
  • solving a long-standing issue of what to do about names that were wrongly spelt by the reporters or have variant spellings and alternatives, which makes it hard for users to search for those people
  • maybe also some way of providing feedback

A possible solution, then, would be a browser-based collaborative interface (for both Old Bailey Online and London Lives), with the facility to view text against image and post contributions.

  • It should be multi-purpose, with differing permissions levels for project staff and registered users.
  • Corrections from users would have to be verified by admin staff, but this would still be much quicker and simpler than the current set-up.
  • But it would be able to do more than just corrections – there would be a way of adding comments/connections/annotations to trials or documents (and to individual people).

A rather different and more programmatic approach to (some of) the errors in the OBO texts than our individualised (ie, random…) and manual procedures was raised recently by Adam Crymble.

For such a large corpus, the OBO is remarkably accurate. The 51 million words in the set of records between 1674 and 1834 were transcribed entirely manually by two independent typists. The transcriptions of each typist was then compared and any discrepancies were corrected by a third person. Since it is unlikely that two independent professional typists would make the same mistakes, this process known as “double rekeying” ensures the accuracy of the finished text.

But typists do make mistakes, as do we all. How often? By my best guess, about once every 4,000 words, or about 15,000-20,000 total transcription errors across 51 million words. How do I know that, and what can we do about it?

… I ran each unique string of characters in the corpus through a series of four English language dictionaries containing roughly 80,000 words, as well as a list of 60,000 surnames known to be present in the London area by the mid-nineteenth century. Any word in neither of these lists has been put into a third list (which I’ve called the “unidentified list”). This unidentified list contains 43,000 unique “words” and I believe is the best place to look for transcription errors.

Adam notes that this is complicated by the fact that many of the ‘errors’ are not really errors; some are archaisms or foreign words that don’t appear in the dictionaries, and some (again) are typos in the original.

Certain types of error that he identified could potentially be addressed with an automated process, such as the notorious confusion of the long ‘S’ with ‘f’: “By writing a Python program that changed the letter F to an S and vise versa, I was able to check if making such a change created a word that was in fact an English word.”

But any entirely automated procedure would inevitably introduce some new errors, which we’re obviously reluctant to do (being pedantic historians and all that). So what to do?

Well, why not combine the power of the computer and the ‘crowd’? We could take Adam’s ‘unidentified list’ as a starting point, so that we’re narrowing down the scale of the task, and design around it a specific simplified and streamlined corrections process, within the framework of the main user interface. My initial thoughts are:

  • this interface would only show small snippets of trials (enough text around the problem word to give some context) and highlight the problem word itself alongside the page image (unfortunately, one thing we probably couldn’t do is to highlight the word in the page image itself)
  • it would provide simple buttons to check for a) the dictionary version or a letter switch, or b) the transcription is correct; with c) a text input field as a fallback if the correction needed is more complex; hopefully most answers would be a or b!
  • if at least two (maybe three?) people provide the same checkbox answer for a problem word it would be treated as a verified correction (though this could be overruled by a project admin), while text answers would have to go to admins for checking and verification in the same way as additions/corrections submitted in the main interface.
  • we should be able to group problems by different types to some extent (eg, so people who wanted to fix long S problems could focus on those)

Suggestions from people who know more than me about both the computing and the crowdsourcing issues would be very welcome!