Some readers may have noticed that my Early Modern Resources website has been down for a couple of months now. I’m rebuilding it, but it’s going to be a little while. In the meantime, here is a google spreadsheet of about 150 online primary source collections from the EMR database.
Among its many other wonders, you can find a marvellous run of 16th- and 17th-century CSPD on the Internet Archive. But they’re not consistently titled, and there are duplicates of many volumes, so it’s not easy to piece them together. I made a chronological list while I was preparing a sample of State Papers petitions for the Power of Petitioning project, so it may be helpful to share it. (For R users, I found the Internet Archive package and this rOpenSci tutorial very helpful.)
TNA guides, including how to convert reference in the calendars to modern references:
I think there’s a complete run of CSPD from 1547 to 1660, after which I’ve found only a handful of volumes. (There are three volumes of calendars for the interregnum Committee for the Advance of Money, but I don’t know whether any other Committees were calendared separately from the main run of commonwealth CSPD; if so, they’re not included.) There may be more volumes I didn’t find, and if I learn of any more I’ll update the spreadsheet.
The url is for the volume’s main page on the Internet Archive, from where you can access a PDF, OCR’ed text version and other formats.
Several calendars have multiple copies with separate pages; where this is the case they’re listed in the additional_ids column. If you want to try one of these instead of the main listing (the choice was arbitrary; some copies might be better quality than others), just copy and paste the id of your choice into the search box.
And that, in turn, got me thinking about how much freely available source material (primary and secondary) I’ve randomly stumbled across on local historical societies’ websites in the last few years. And wondering: how much more is out there?
I got several great responses to this*, so I began looking more closely, and the TLDR; answer is: a helluva lot of it. The upshot was a Google spreadsheet, which you can see at the bottom of this post.
I’m genuinely impressed at how much stuff these societies have put online, and several more are clearly keen to follow suit – if they can fund it. Some have adopted a pragmatic policy of embargoing their most recent publications (anything between 3 and 10 years; and if you really can’t wait that long, you can buy print or digital copies – or a subscription – for sod all). They often have limited resources, and good quality digitisation isn’t cheap. So, y’know, do encourage societies of interest to you to do this; but don’t lecture them if they haven’t (you might consider instead how you could actually help them to do it).
It’s also more generally noteworthy how many societies have websites (some kept more up to date than others…), and even if they haven’t digitised the publications, nearly all have made finding aids of some sort (indexes, TOCs, abstracts, etc, even searchable databases) available.
(And undoubtedly all this applies far beyond England and Wales, but someone else will have to compile those resources. Sorry.)
Why am I telling you all this? Because these local societies (under their many and varied names: “record/historical/antiquarian/archaeological” society, or some entirely quirky local name) are treasure troves for historians, not just those who think of themselves as “local” historians. They’ve been around for a long time (many were established in the 19th century), publishing high-quality source editions, calendars, abstracts, extracts, indexes, etc, for a wide range of archive sources – parish, legal and administrative, personal, estate records, and more – as well as secondary articles. But often they were published in tiny print runs and even finding aids were hard to come by before the advent of the online catalogue. So it’s a wondrous thing that so many can now be accessed freely and located much more easily.
In addition to content found at society websites, I added a couple more tabs to the spreadsheet: some of the many publications digitised for Welsh Journals Online, and an undoubtedly tiny portion of what might be found at the Internet Archive. Enjoy exploring!
1. You’re almost certain to find it worth the effort
Often, in the endless “should academics learn to code” debate, it’s not clear to newcomers what you can actually use this code for once you’ve invested a lot of time in learning it. Copy&paste online tutorials don’t tend to make things much clearer. How do you get from “hello world!” to practical applications for your research?
But R is all about analysing and presenting data and there aren’t too many historians who don’t work with data of some kind sooner or later. If you already use spreadsheets or SPSS or databases of some kind, and if you ever present tables or graphs in papers, you’ll almost certainly get something out of learning even a small amount of R (and there’ll probably be R packages to make it easy to use it with your current tools). R is flexible: it can be used with conventional tabular statistical data or with linguistic corpora and other textual datasets. You can use it for heavyweight number crunching, textmining, exploratory visualisations at the start of a project, and spectacular ones in presentations and publications – all sorts of humanities data uses. (I reallyreallyreally want to find an application for the beautiful viz in this blogpost.)
2. You don’t have to do it on the command line
I know some people love command line tools. But a good graphical user interface can make all the difference for newcomers and those of us who actually don’t look forward to firing up Terminal. After installing R itself, RStudio is the next thing to download (it’s free). It’s a proper work horse, including a code editor, console, R packages manager, visualization tools, previewer and more. If you already use Markdown (and maybe even if you don’t yet), you’ll love RMarkdown and R Notebooks.
(On a different GUI-related topic, see also: Github Desktop. You’re welcome.)
3. The Tidyverse
One of my periodic rants is that historians need to understand data and data modelling (even if they don’t think they work with “data”) before they worry about programming code. With R you can learn about both at the same time. The Tidyverse is described as “an opinionated collection of R packages designed for data science”, which “share an underlying philosophy”; tools for creating and working with Tidy Data.
4. Great online learning resources by and for historians
The Programming Historian has several R tutorials from the very basic to more advanced techniques. Currently I think there are four:
R Basics with Tabular Data
Data Wrangling and Management in R
Basic Text Processing in R
Correspondence Analysis for Historical Research with R
Looking beyond these short tutorials, Lincoln Mullen has developed a free online textbook, Computational Historical Methods, “how to identify sources and frame historical questions then answer them through computational methods”, using R.
I’ve written my last two conferencepapers entirely in R. This means that everything is in plain text and I can easily post online all the data, code and visualisations I used. I put them on Github, but there are other options, like RPubs (from the people who make RStudio, and it’s really easy to send stuff from RStudio straight to RPubs).
My academic apprenticeship, in Aberystwyth, was spent engrossed in two things: first, early modern Welsh and northern English crime archives, and second, the potential of the Internet for research and teaching and simply opening up early modern history to as many people as possible. That wasn’t a completely respectable interest back in 1999, and I’m still amazed sometimes that I’ve been able to spend I’ve spent the last 7 years indulging shamelessly in that obsession and get paid for it.
But what about the first of my obsessions? A couple of weeks ago, the Financial Times told us that more cranes have been erected in London in the past 3 years than everywhere else in the UK put together. I have a nagging worry that I’ve unwittingly contributed to a similar situation in the digital sphere.
I’ve found at least 300 scholarly publications citing OBO, so it’s certainly made its mark on academic research. Beyond academia, it’s directly generated family histories, novels, radio and TV dramas and documentaries. But what impact has it had on digitising crime history? 10 years on, vast swathes of our criminal records remain untouched by the digital. And while there has been large-scale digitisation of sources that crime historians use, not much of it is freely accessible, and little of it has been done by or for us.
A number of historians over the years have worried that OBO skews attention – and resources – disproportionately towards London and the higher courts, representing a tiny minority of prosecuted crimes and policing. As the digital historian and project manager, I’m thrilled to learn of young researchers who chose history of crime because of OBO. But my other half, the archives researcher, is more ambivalent.
Early modern court archives aren’t like our neatly packaged, readable trial reports. They’re unwieldy, often dirty, fragmentary, intimidating in overall scale. Documents vary hugely in size, structure, handwriting, materials used and condition, defying any ‘one size fits all’ approach to digitising. They’re frequently written in heavily abbreviated Latin, or ponderously legalese-d English, or an unholy mix of both.
Who would want to struggle with that if they can use something like OBO instead? Would I, if I were a PhD student now? And how much easier is it to turn to OBO for immediate digital rewards than to start new digitisation projects with such awkward and intractable material?
I was asked to introduce themes and challenges that I think are important for the future of digital crime history. So here’s the first challenge: improving digital access to documents like this, and the hundreds of thousands like them in our archives. A second challenge: as always, how to pay for it and sustain it in the long term. And a third is the digital skills we need: I don’t mean necessarily programming, but understanding something about various kinds of code, how to work with digital data, how to work with people who do programme.
And then there are two themes I want to emphasise, that can help us to face the challenges: the need to re-use, recycle and share digital content; and the importance of collaboration and partnerships.
I’ve blogged recently about the dual identity of the Proceedings; ideal for quantitative analysis, which needs a structured database; but also containing many rich, engaging witness narratives that demanded full text. The solution found in OBO’s case was to transcribe using a double-rekeying process that’s less accurate than traditional standards for scholarly editions, but far more accurate than OCR, and then mark up transcriptions with XML tags to create structure that can be extracted and turned into a database.
There are certainly downsides to this: time-consuming, expensive, unwieldy. [Both Tim and I agreed in the subsequent discussion that we wouldn’t try to do it quite like that today, though I’m not sure we’d be in complete agreement on exactly what we would do instead…]
But the upsides: accuracy, completeness, versatility.
Having created our digital data, it can manipulated and re-used in many ways. Convert it into other formats. Index it in different ways for different kinds of search. And even transform it with new markup for different purposes, as in Magnus Huber’s Old Bailey Corpus Online. There have been uses of OBO that no one could have predicted.
Bridget: Searching in London Lives, Connected Histories, 18th-Connect (are those other results the same person?); a dot somewhere in this graph from Datamining with Criminal Intent. Same data in four places: making new connections, seeing trials in different ways.
I’d argue there are two lessons from OBO for everyone, whatever kind of project or source they have in mind:
digitise in a way that best captures the information in a source;
& which facilitates future re-use and collaboration
Not the specifics of transcription or markup or any particular search engine. Given that many crime history sources are heavily formulaic, or in Latin before 1733, sometimes verbatim transcription can hide as much as it reveals, make it harder to find the useful stuff. Some – many – of our sources simply don’t have rich stories to tell like OBO.
Creating data that is clean and consistent, well-structured and accurately documented may cost more at the beginning, it may require more of an investment in technical skills and management, but it will make your efforts worth more in the long run.
What kind of collaborations and partnerships do we need? Let’s start with the vital one: the relationship between the historians and the keepers of the archival documents. Well, I admit I’m worried about that relationship. Here’s one reason why.
Why does this resource trouble me? It’s not just because it’s behind a paywall.
OK, in an ideal world, all these resources would be freely accessible to all. But I know all too well how expensive digitisation is. Someone has to pay; it’s just a question of how. The grim reality is that archives and libraries are under intense financial pressure and it’s only going to get worse: and that one of the few reliable paying audiences outside academia is family historians.
And findmypast have made a great, affordable, resource for family historians. But it’s a terribly limited one for crime historians. It’s a name search; as far as I can tell, no separate keyword search or browse. (If I’m wrong, there’s nothing telling me so.)* The needs and priorities of family historians and academics overlap, but they’re not close enough that creating resources that can serve them both well just happens.
Then you have, say, Eighteenth Century Collections Online or British Newspapers 1600-1900, which are designed more for academic audiences, but virtually inaccessible to individuals outside academic institutions, and those whose institutions can’t afford the price tag. And even then pretty much all you can do with those is keyword search and hope what you want isn’t lost in garbled OCR text.
Both kinds of resource are black boxes that make it impossible for a researcher to evaluate the quality of the data or search results; and hinder any kind of use other than those the platform was specifically built for. And if the data is locked away in a box it can never be corrected or improved or enhanced – even though the technology to enable that is continually developing. So publishers lose out, in the end, too.
Are there alternatives to the black box?
The Text Creation Partnership is funded by a group of libraries led by Michigan. It’s transcribing content from major commercial page-image digitised collections. The resulting texts are restricted to partnership members and resource subscribers for 4 or 5 years and then released into the public domain.
The images continue to be behind the paywall, and not all texts are transcribed. For ECCO the proportion is small, but EEBO’s goal is much more ambitious: one transcription for each unique text (usually first editions). In January 2015, 25,000 phase 1 EEBO texts will become available to everyone for search and to download for textmining or whatever else we can think of by then. (Phase 2 in ?2019 will be something like another 40,000 texts.)
It surely is not beyond our wit to translate that kind of public-private collaborative model to crime records [suggestion here], and for that matter, other archival records with overlapping academic/family history user groups. But to do so, I think we need to build partnerships between historians, archivists and publishers much more than we’ve been doing. And if what you want is totally free-to-access resources, you still need to work with archivists to find answers to the ‘who pays?’ question. I hope today can be a good starting place.
But it’s not enough simply to think about institutional collaboration.
A lot of smart people are thinking very hard about ways to facilitate collaborative user participation in digital resources – transcription, indexing, correction, tagging, annotation, linking, and they’re building tools whose usefulness often isn’t confined to volunteer projects and ‘crowd sourcing’. (The re-use maxim applies here too: don’t build from scratch if other people have already done the hard work building and testing good tools.)**
However, don’t imagine ‘the crowd’ is an easy option. OBO has been trying it out for a while and we’re only sort of getting there.
Part of our problem, on reflection, has been adding these things near the end of a project when we launch a website and then hope for something to magically happen while we go off to the next project.
A second issue has been user interfaces and design. We took a while to learn that we have to make participation easy, really easy, and we have to build the design in from early on. It’s no good building something that needs a separate login from the rest of the site, with a flaky user database that was tacked on as an afterthought. Third, and related: understand the limits of what most users are willing to do.
Two years ago, we added to OBO a simple form for registered users to report errors: one click on the trial page itself. People have been using that form not simply to report errors in our transcriptions but to add information and tell us about mistakes in the original. The desire on the part of our site users to contribute what they know is there. Just don’t think that means there’s a ready-made ‘crowd’ waiting to turn up and help you out without plenty of effort on your part.
Historians and our dirty data
And what about us, the historians who have been or are working in the archives? We’re all digitizers now, and have been for a long time. Well, sort of. My computer has folders of databases, transcriptions and (to use the technical term) “stuff” that is kept from public view because, well, it’s a mess and I never get around to the data cleaning needed and it would be embarrasing to let people see my mistakes. I’m sure I’m not alone. [There were nods and sheepish grins all round at this point. You know who you are.]
Increasingly in future, there are going to be requirements from funders to share research data in institutional repositories and the like. We should not be assuming that means just scientists! We shouldn’t in any case be doing this just because someone demanded it; it should become a habit, the right thing to do to help each other.
But we need to get the right training in digital skills for students, so they know how to make good, shareable data, and how best to re-use data shared by others. (Full disclosure: I’m working on a project to create an online data management course for historians at the moment…)
Digitisation, digital history and re-usability don’t have to be all about big funded projects. It can start with personal decisions and actions: clean up your old data, put it in your institutional repository, share it with a Creative Commons licence, tell your colleagues and students it’s there. Relinquish some control. [more thoughts on this]
If we digitise for re-use, and re-use to digitise, we can share and collaborate, and build partnerships that can make some of the challenges of digitisation less intimidating. Digital history should be an iterative, accumulative, learning process rather than one-off ‘projects’ to be ‘launched’ and then left to gather dust.
* The findmypast resource has only gone up recently and the content isn’t complete yet. Keyword search functionality is apparently supposed to be included in the resource and it’s possible that will become available as it rolls out. But it should be noted that even a keyword search is unlikely to fulfil the needs that crime historians often have to crunch numbers in complex ways.
** The tools and projects in this slide really are a tiny sample of what’s happening now. They are:
Early Modern Resources is going to change. The site has been accumulating content for more than a decade now without changing significantly in its functions or intent. Meanwhile, the Web has expanded dramatically. There are now far more high-quality scholarly resources, especially collections of primary sources. But, just as important, there is also a much larger community of early modernists online.
As I began a very overdue review of links in the EMR database in April 2013, I soon began to feel that it needed something more drastic than another spring clean. Many of my older summaries were too short and unspecific to be at all helpful. Over time (unsurprisingly enough) my editorial decisions have been inconsistent (and occasionally mildly puzzling). Not all the early content had much substance, and some pages, though still accessible, were completely out of date. Conversely, however, some websites that I had linked when they were first established have grown far beyond my original summaries.
I’m not taking down Early Modern Resources “v.1” for the moment: it can be found at http://earlymodernweb.org/emr where it will be fully searchable, as at present (there might be the occasional broken internal link and, of course, a growing number of dead links to resources). However, I will not be doing any further updates to the site, and eventually it will go away altogether.
Instead, I’ve started building a new website, an ‘Early Modern Hub’, with a number of interconnected areas:
resources will be more tightly focused, emphasising resources for researchers (whether students, academics, independent scholars), especially online primary sources, and sites of scholarship. Hopefully, listings will provide more detailed and useful information, especially about large resources. I will continue to include only content that is free to access.
news and events – this will include the Early Modern News blog which is currently hosted at WordPress.com; it may also add links to external resources that I’ve been unable to include in EMR previously, such as blogs/sites set up specifically for conferences and short term projects.
blogs – I’d like to integrate Early Modern Commons more closely into the site; and think about ways to bring in Twitter links and conversations
people – I’m not quite sure yet what this might consist of, but I’d particularly like to facilitate a network to support postgraduate students, early career researchers, independent scholars and alt-academics. This might involve setting up a network along the lines of MLA Commons, if there’s enough demand for it. At the very least I’d like to have some kind of directory to which researchers would submit their own online profile. (Suggestions welcome.)
This will take a while to come together – watch out for news!
There are more than 200,000 pages of manuscript material from parish, criminal justice and hospital records, transcribed and marked up for searching in the same way as the Old Bailey Proceedings Online. Plus the 18th-century Proceedings and Ordinary’s Accounts and a group of additional datasets.
The emphasis is on searching for people (although there is also a keyword search) and on nominal record linkage, to facilitate writing the biographies of ordinary and extraordinary 18th-century Londoners.