A second follow up to my digital crime history talk with (hopefully) some more practical notes and resources.
I’m as guilty as anyone of holding on to my old research data (databases, transcriptions, abstracts, calendars, etc of primary sources), so this is pulling together some stuff to prod me into action this summer. I have material going right back to my MA thesis that I keep not getting round to sharing. My resolution is that I’m going to try to do better this year. This post is partly to help me (and maybe you too?) to get there.
A couple of thoughts first. I think there are two slightly different meanings of ‘messy’ to be teased apart here.
One is about errors and gaps in the content of the material – we all make mistakes in transcription, especially when we’re learning the ropes; we leave “???” where we couldn’t quite make out what the source said; etc. I’m not talking about cleaning all of that up to spare your embarrassment. You know you’ll never get round to sorting that crap out (especially if it would require going back to the archives); so it’s time to try to get over your shame and be prepared to share it warts and all (and include appropriate caveats).
I’m thinking more in terms of tidying the structure, format, labelling and documentation of the data. So, for example, someone else can import it into database without columns ending up all over the place, or with abbreviations and codes that no one but you can interpret.
- Ensure it includes the archive references for the originals!
- Add documentation that explains the dataset (and its limitations) accurately. If you’ve used codes in any fields on a spreadsheet or databse, for convenience in data entry, make sure you add a list of what they actually stand for (or use Find & Replace to change them to something more transparent!).
- Be clear about what it represents (eg full transcriptions, partial transcriptions, just summaries, etc)
- Ideally, write the documentation itself in a structured data format to make it machine-readable (“metadata“; also, this intro).
- Convert it to a non-proprietary format (eg .csv text files); or use one that has good inter-operability for people who don’t have the software you used to create it. Eg, Excel spreadsheets (.xls, .xlsx) can be opened in quite a lot of different packages and converted to other formats quite easily, but Access databases (.mdb) are much more difficult to share. Plain text files (.txt) are preferable to Word docs.
This last point isn’t just about data sharing, by the way, but also about preservation for your own use in the longer term. Do you want to find yourself unable to access all that hard work you did in the archives in just a few years because a manufacturer stopped making that software you used, or the version of it that you were using isn’t supported any more?
And don’t forget, if you share data, you can get credit back too. (Make sure your documentation includes clear citation guidance…) Use a Creative Commons licence or something similar.
At the same time, this isn’t about trying to achieve perfection. (I don’t really advise going and plunging into the data management guidelines at the UK Data Service unless you have a lot of time on your hands…)
There are two resources I plan to try out (hey, I might even blog about the experience):
A repository for research data that takes care of long term preservation and good archival practices, while researchers can share, keep control of and get recognition for their data.
(This may be aimed at social scientists, so I don’t know quite what it’ll be like to use as a historian. Only one way to find out…)
2. OpenRefine (was Google Refine)
a powerful tool for working with messy data, cleaning it, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase.
The School of Data is something I’ve bookmarked recently that may also have something useful.
Right, who’s with me?
5 thoughts on “Unclean, unclean! What historians can do about sharing our messy research data”
Open Refine has been my new best friend for a few months now. Tony Hirst has some nice protips on his blog > http://blog.ouseful.info/?s=google+refine
There are non-technical reasons that make release of raw research material a challenge.
We spent some time talking about this at the “Show Me Your Data: Scholarly Notes/Public Expectations” session (proposal, notes) at THATCamp Leadership two weeks ago. The problems that we’re running into are not with archiving a dataset per se, but rather with repurposing it for public use.
There are all kind of perils for the scholar invoved, since citations were appropriate for a book but not a searchable website, as well as for the public, since transforming manuscript census entries into a database sortable and searchable by attribute allows the public to easily get lists of people recorded as “whore”, “negro trader”, illiterate or insane.
I’ve amended the raw session notes with more details about our discussion after reading this post.