A Zotero resource, and bibliographies online – revisited


Earlier this week, I led a one day course on using Zotero at the British Library (part of their Digital Scholarship training programme for staff) – many thanks to James Baker for the invitation.

It was a very hands-on course, starting with the assumption that most people there would never have used Zotero before, and gradually building up in difficulty. We packed a lot in in one day and the approach seemed to go down well.

James also generously agreed to me opening up the web resource I put together for the course (in PmWiki) for public consumption. It contains most of the exercises we worked through during the day – they are quite strongly BL-oriented, with plenty of my favourite topics (naturally…) but I think more generally applicable – as well as selected examples of the different kinds of things people and projects have done and are doing with Zotero – from teaching, group collaboration, research management, plugin development, publication, integration with other resources, and so on.

And so, here it is, under a Creative Commons license – use, re-use, mix, borrow and adapt if you’d find it useful!


Additionally, I found lots of interesting things while I was preparing the course, so I put them into a Zotero bibliography – well, what else?! – and made it into a public group, which Zotero users are very welcome to join and add to:

Managing Digital Research

I found myself answering the question “Why Zotero?” with some personal history, quite a bit of which was chronicled here on this blog over the years.  It occurred to me that I’ve been trying to manage references since my undergraduate dissertation more than 15 years ago, and I’ve been publishing bibliographies online for more than a decade (in the firm belief that it’s one of the most useful small things scholars can do for each other and for students). I’ve been through:

  • index cards (u/g and MA dissertations)
  • a homebrewed MS Access database (for my PhD secondary sources)
  • Endnote (for a while, but only because I got it cheap from my uni)
  • BibDesk (which I still use to some extent)
  • CiteULike
  • LibraryThing
  • Connotea, Mendeley, and probably other things of that ilk
  • Semantic Mediawiki (interesting but too much hard work)
  • wikindx (still in use, but probably phasing out soon)
  • Aigaion
  • and quite a few other things used so briefly I’ve forgotten them…

When I did my PhD research in the early 2000s, I put sources I wanted to quantify in an Access database; secondary references in another one; transcriptions in word documents (slightly later, they ended up in a different text database); all separate objects, hard to relate to each other. Even though most of my PhD sources haven’t been digitised (and probably never will be), today with Zotero I would approach much of that task quite differently. OTOH, my interest in references in recent years has more often been to do with how to publish large bibliographies online and keep them up to date. Well, Zotero covers that too.

So, for me Zotero has won the contest, hands down. A few of the tools listed can perhaps do specific things better than Zotero, and most of them are just as free (several are open source), but none of them is as versatile and powerful while being so easy to use and to customise. (Wikindx, for example, is excellent, but you need to be able to install a MySQL database and really to understand a bit about PHP and web apps.)

Zotero provides much more than just “reference management”. It isn’t just that you can quickly save and archive lots of different kinds of things you find online, but also that you can use it to manage research as a process, with changing needs over time – right through from collecting sources to analysis and writing and publishing.

In 2009, when Zotero was in its infancy – before much of its cloud and collaboration features existed, or they’d only just begun to develop – and I’d barely used it (just 42 items in my 3000+ library were added before 2010), I blogged about the impossibility of online collaborative bibliographies. Hahaha!

On Wednesday, I created a Zotero group live during the course (that took about a minute), and in the space of half an hour about six people, most of whom had never used Zotero at all before that day, put about 30 items in it, and added notes and attachments, ranging from news articles and reviews to youtube videos. (At the other end of the scale, of course, there are Zotero groups creating major resources for their communities.)

Sometimes it’s great to be proved so completely and utterly wrong.

Even in that 2009 post, I see that I added a comment wondering if Zotero could be the solution to the problem. Maybe, too, the discussion we had about the decision to turn the RHS British and Irish History bibliography into a subscription service could look very different now.

Save Us From Carousels


I ranted on Twitter a while ago about the fad for auto-rotating carousels, sliders, changing images, and whatever, on homepages for academic and cultural sites. Quite a few people seemed to agree with me. Well, the nasty things have not gone away since then. Quite the opposite, it seems: every other digital project, research centre, or library collection appears to have decided that its homepage simply must have some bloody great flickering, twitching gizmo taking up a large chunk of the screen. (I haven’t looked, but I have dark suspicions that some of this infestation is down to WordPress plugins just making it too damn easy.)

Why am I on a homepage? Because I’m getting my bearings, especially if it’s my first visit. I want to know what the site contains of interest to me. And I want to do this quickly so I can get to the good stuff. I’m not going to wait for a carousel to go round, like it’s a TV screen, in the hope it might eventually display something useful to me. In fact, my first reaction on realising it’s one of those is generally “Arghh!! Scroll away NOW!!” So any utility it has is pure chance: if the very first panel it displays happens to be of interest to me, and stays there long enough to let me read it and click on it, I might click on it.

Naturally, I find it hard to believe most people don’t agree with me. But perhaps I’m wrong. Perhaps there’s loads of usability research backing up this design concept that says I’m the weirdo: most people love watching website carousels go round, find them useful entry points to a website and noooo, not distracting at all.

So I went looking around. First thing: you know what? Most developers hate them too. Second: there doesn’t actually seem to be very much empirical data, certainly not any scholarly research, though there’s plenty of anecdote. There are quite a few examples of developers and commercial UX-y people saying “yeah, we ran tests and people found them annoying”, but no numbers. I’ve found a few designers who love them because they look “cool” and “slick” and suchlike twattishness. I’ve yet to find one real website user with a good thing to say about them.

Still, what data there is says: most people don’t use them, lots of people don’t like them,  and they can actually make it harder, not easier, for people to find useful information. Users tend to blank them out as irrelevant (“banner blindness”), but worse, they make it physically harder to focus on the information around them. Flickery moving things are distracting: whodathunkit?

Accordions and carousels should show a new panel only when users ask for it. Otherwise, it should stand still and let users read the information in peace, without having the rug yanked from under them. As our user said about Siemens’ big rotating box: “I didn’t have time to read it. It keeps flashing too quickly.”

While it’s obviously less annoying, I think a standard static carousel is pretty much useless, like a new version of Mystery Meat Navigation. I want to get information, not play a “Guess what’s next?!” game. I’m not going to use it. Still, at least I’m not going to swear at you while trying to make my escape as rapidly as possible. (Though I quite like ‘accordion’ style designs with text labels that open up. Having something to tell me what’s hiding under there makes all the difference.)

“Approximately 1% of visitors click on a feature [on a static carousel on one of the ND sites]… Of these clicks, 84% were on stories in position 1″. An auto-rotating carousel on another ND site did rather better: just under 9% of visitors clicked through, with the first feature shown averaging 40% and the rest ranging from 18% down to 11%. But those are still pretty small numbers for something that’s going to piss off a significant proportion of your site visitors, aren’t they?

Reason #1: Human eye reacts to movement (and will miss the important stuff)

Reason #2: Too many messages equals no message

Frost argues the real reason we get carousels is primarily political:

From universities to giant retailers, large organizations endure their fair share of politics. And boy does that homepage look like a juicy piece of prime real estate to a roomful of stakeholders. It’s hard to navigate these mini turf wars, so tools like carousels are used as appeasers to keep everyone from beating the shit out of each other.


A final thing, for people on academic projects planning websites. E-commerce sites and the like have plenty of money for regular website re-designs and refreshes. You won’t. If you don’t want your site to look tired and dated within months it’s in your interests to avoid fads and gimmicks on your homepage. And when it’s a fad that will irritate a substantial proportion of your site visitors, and be useless to nearly all of them, please JUST SAY NO.

Digital history blogs


I compiled a quick list very recently for someone who was looking for introductions to digital history and people doing digital history work. And having done it, I thought I might as well share it.

Firstly: in one respect, this is a broad tent – some of these people are strictly speaking in literature or historical linguistics. But the boundaries are fuzzy, and what they’re doing is relevant to historians’ research too.

Secondly: but in another, it’s a fairly narrow subset of digital historians who blog – people who are posting about digital tools and techniques that they’re using, things they’re building, practical hacks and code, reflections on the process and the results they’re getting from doing those things.

Thirdly: it was put together very quickly from my RSS feeds and Twitter favourites. Who am I missing? (Feel free to plug your own blog.) What group blogs should be included?

  • Bill Turkel – “computational history, big history, STS, physical computing, desktop fabrication and electronics”
  • Tim Sherratt – “digital historian, web developer and cultural data hacker”; Invisible Australians
  • Adam Crymble – large-scale textual analysis; 18-19th century London
  • John Levin – mapping and visualisation; 18th-century London
  • Jean Bauer – database design and development; late 18th/early 19th-century USA
  • Jason Heppler – hacking/scripting (Ruby evangelist); 20th century USA
  • Caleb McDaniel – hacking/scripting; American abolitionism
  • Chad Black – hacking/scripting; early Latin America
  • Lincoln Mullen – databases, R; religion in 18-19th-century America
  • Fred Gibbs – mapping, metadata, textmining; medieval/early modern medicine and science
  • Jeri Wieringa – textmining; American religious history
  • Ben Brumfield – crowdsourced transcription software (software developer, family historian)
  • Heather Froehlich – corpus linguistics; early modern drama and gender (lit/lang)
  • Ted Underwood – “applying machine learning algorithms to large digital collections” (lit)

Not recently active so I nearly forgot about them…

Because I really do have a terrible memory (sorry…)

More via Twitter (thanks @paige_roberts, @wynkenhimself)

From comments (thanks!)

Labels are a bit random, I know: just for a flavour of what people do. Tidying up might happen later.

Happy 10th Birthday WordPress


I’ve been using WordPress since 31 July 2004 (I wouldn’t remember myself, but the archives are there to tell me so), which was something like v1.5. It’s hard to express just how important it’s been to me during that time. With WordPress I first learned about  MySQL databases; it gave me my first experiences of hacking PHP code; it was where I started properly using CSS. (Oh, and it also provided my first experience of having a website hacked. Hey, we live and learn.)

I moved this blog over to wordpress.com a couple of years ago (in part to stop me spending more time faffing around making it pretty than I spent actually writing on it), but that doesn’t mean I’ve stopped using self-hosted WP. Far from it. I’ve hand-coded websites from scratch, partly because I wanted to learn how to do it, partly because sometimes WP is overkill if you only want a simple site, but WP is still my go-to for setting up a more complex CMS.

This has all been possible not simply because WP is open source software, but also because from its earliest days it’s had great documentation that’s intelligible to people who don’t already know how to programme, and it’s had the friendliest software support forums I have encountered anywhere. Bar none. The whole ethos, the community buit around it, not just the code, has always been open.

Thank you for everything, WordPress and here’s to the next 10 years.

Academic blogging: pleasure and credit


I was asked a question a few months ago about how we could go about giving academics more scholarly recognition and credit for blogging, and I realised how ambivalent I feel about this.

On the one hand, I would love to see quality blogging given the credit it deserves; I’d love to see young academics encouraged to blog, to network and support each other, and to engage with audiences that don’t just consist of other academics.

But on the other hand, it seems to me there’s a huge danger that blogging would simply be added on to the existing systems for awarding and measuring academic credit.

Imagine that for REF2026 (or for your tenure in a US university), on top of all the conventional published ‘outputs’, academics must also submit a set number of research blog posts. And that most historians deal with this by copying & pasting the texts of their conference papers into a blog and hitting the Publish button a few times a year.

Of course, there are blogs already that mainly consist of that kind of material and it can make for a very good blog if the material is well chosen and written in the first place. And sharing good talks and presentations with wider audiences is a good thing to do.

But imagine the bland soul-numbing horror of hundreds or thousands of ‘blogs’ which exist purely to fulfil the requirements of a bureaucratic exercise and contain nothing but slabs of text that were dull when they were first read out to six people including the session panel, and will still be dull when they end up in their final article/monograph form to be read by reviewers and bored students in university libraries.

Just because some written online content uses blog software doesn’t make it blogging

And how much harder would it become to find the good academic blogging, where scholars want to communicate what they know and love, and where they engage and debate? How on earth would we persuade new academics that blogging is something you can do for enjoyment, if it becomes just another mandatory task?

Is compulsion and institutionalisation an inevitable outcome – the Satanic pact – of gaining scholarly credit in the corporate, bureaucratic academy?

To be honest, I’m not optimistic that there’s a way to gain the recognition that many academic bloggers have longed for without destroying what I believe is the real value of academic blogging, which is in many ways about pleasing yourself, escaping the targets and the quotas and the faceless bean-counters; about communicating and sharing through spontaneity and idiosyncratic self-expression. (So, I don’t blog for weeks at a time; and then I’ll write five posts in a weekend. Because I want to, and because I can, dammit.)

This personal self-indulgence doesn’t just happen to serve a wider, public purpose: it serves that purpose because it’s personal and indulgent and risky, because academic bloggers are willingly choosing to share the learning and understanding earned through those long, long hours in libraries and archives, and because they give something more of themselves than ‘mere’ knowledge.

History of Crime Blogs


Think of this as the “more hack, less yack” post. I’m putting together an aggregator for history of crime/justice/punishment blogging:

The New Newgate Calendar

I’ll do more later and add a form for people to submit more blogs and so on, but I wanted to get the basics up and running this weekend. (If it sounds at all familiar, it’s because there has been a page there with that name for a while – but this is New New!)

Collaboration and crowdsourcing for Old Bailey Online and London Lives


My digital crime history talk included some mention of ‘crowd sourcing’ and our stuttering efforts in this direction (on various projects) over the last five years or so. This post is intended as a marker to get down some further thoughts on the subject that I’ve been mulling over recently, to start to move towards more concrete proposals for action.

Two years ago, we added to OBO a simple form for registered users to report errors: one click on the trial page itself. People have been using that form not simply to report errors in our transcriptions but to add information and tell us about mistakes in the original. The desire on the part of our site users to contribute what they know is there.

We now have a (small but growing) database of these corrections and additions, which we rather failed to foresee and currently have no way of using. There is some good stuff! Examples:

t18431211-307    The surname of the defendents are incorrect. They should be Jacob Allwood and Joseph Allwood and not ALLGOOD

t18340220-125    The text gives an address Edden St-Regent St. I believe the correct street name is Heddon St, which crosses Regent St. There is no Edden St nearby. There is an Eden St and an Emden St nearby, but neither meet Regent St.

t18730113-138    The surname of the defendant in this case was Zacharoff, not Bacharoff as the original printed Proceedings show. He was the man later internationally notorious as Sir Basil Zaharoff, the great arms dealer and General Representative abroad of the Vickers armaments combine. See DNB for Zaharoff.

t18941022-797    Correct surname of prisoner is Crowder see Morning Post 25.10.1894. Charged with attempted murder  not murder see previous citation.

It also bothers me, I’d add, that there’s no way of providing any feedback (let alone encouragement or thanks). If I disagree with a proposed correction, I don’t have a way to let the person reporting the issue know that I’ve even looked at it, let alone explain my reasoning  (someone suggested, for example, that the murder of a two-year-old child ought to be categorised as ‘infanticide’, but we use that term only for a specific form of newborn infant killing that was prosecuted under a particular statute during the period of the Proceedings).

On top of which, I think it’s going to become an increasing struggle to keep up even with straightforward transcription corrections because the method we’ve always used for doing this now has considerably more friction built in than the method for reporting problems!

So, the first set of problems includes:

  • finding ways to enable site users to post the information they have so that it can be added to the site in a useful way (not forgetting that this would create issues around security, spam, moderation, etc)
  • improving our own workflow for manual corrections to the data
  • solving a long-standing issue of what to do about names that were wrongly spelt by the reporters or have variant spellings and alternatives, which makes it hard for users to search for those people
  • maybe also some way of providing feedback

A possible solution, then, would be a browser-based collaborative interface (for both Old Bailey Online and London Lives), with the facility to view text against image and post contributions.

  • It should be multi-purpose, with differing permissions levels for project staff and registered users.
  • Corrections from users would have to be verified by admin staff, but this would still be much quicker and simpler than the current set-up.
  • But it would be able to do more than just corrections – there would be a way of adding comments/connections/annotations to trials or documents (and to individual people).

A rather different and more programmatic approach to (some of) the errors in the OBO texts than our individualised (ie, random…) and manual procedures was raised recently by Adam Crymble.

For such a large corpus, the OBO is remarkably accurate. The 51 million words in the set of records between 1674 and 1834 were transcribed entirely manually by two independent typists. The transcriptions of each typist was then compared and any discrepancies were corrected by a third person. Since it is unlikely that two independent professional typists would make the same mistakes, this process known as “double rekeying” ensures the accuracy of the finished text.

But typists do make mistakes, as do we all. How often? By my best guess, about once every 4,000 words, or about 15,000-20,000 total transcription errors across 51 million words. How do I know that, and what can we do about it?

… I ran each unique string of characters in the corpus through a series of four English language dictionaries containing roughly 80,000 words, as well as a list of 60,000 surnames known to be present in the London area by the mid-nineteenth century. Any word in neither of these lists has been put into a third list (which I’ve called the “unidentified list”). This unidentified list contains 43,000 unique “words” and I believe is the best place to look for transcription errors.

Adam notes that this is complicated by the fact that many of the ‘errors’ are not really errors; some are archaisms or foreign words that don’t appear in the dictionaries, and some (again) are typos in the original.

Certain types of error that he identified could potentially be addressed with an automated process, such as the notorious confusion of the long ‘S’ with ‘f’: “By writing a Python program that changed the letter F to an S and vise versa, I was able to check if making such a change created a word that was in fact an English word.”

But any entirely automated procedure would inevitably introduce some new errors, which we’re obviously reluctant to do (being pedantic historians and all that). So what to do?

Well, why not combine the power of the computer and the ‘crowd’? We could take Adam’s ‘unidentified list’ as a starting point, so that we’re narrowing down the scale of the task, and design around it a specific simplified and streamlined corrections process, within the framework of the main user interface. My initial thoughts are:

  • this interface would only show small snippets of trials (enough text around the problem word to give some context) and highlight the problem word itself alongside the page image (unfortunately, one thing we probably couldn’t do is to highlight the word in the page image itself)
  • it would provide simple buttons to check for a) the dictionary version or a letter switch, or b) the transcription is correct; with c) a text input field as a fallback if the correction needed is more complex; hopefully most answers would be a or b!
  • if at least two (maybe three?) people provide the same checkbox answer for a problem word it would be treated as a verified correction (though this could be overruled by a project admin), while text answers would have to go to admins for checking and verification in the same way as additions/corrections submitted in the main interface.
  • we should be able to group problems by different types to some extent (eg, so people who wanted to fix long S problems could focus on those)

Suggestions from people who know more than me about both the computing and the crowdsourcing issues would be very welcome!

Unclean, unclean! What historians can do about sharing our messy research data


A second follow up to my digital crime history talk with (hopefully) some more practical notes and resources.

I’m as guilty as anyone of holding on to my old research data (databases, transcriptions, abstracts, calendars, etc of primary sources), so this is pulling together some stuff to prod me into action this summer. I have material going right back to my MA thesis that I keep not getting round to sharing. My resolution is that I’m going to try to do better this year. This post is partly to help me (and maybe you too?) to get there.

A couple of thoughts first. I think there are two slightly different meanings of ‘messy’ to be teased apart here.

One is about errors and gaps in the content of the material – we all make mistakes in transcription, especially when we’re learning the ropes; we leave “???” where we couldn’t quite make out what the source said; etc. I’m not talking about cleaning all of that up to spare your embarrassment. You know you’ll never get round to sorting that crap out (especially if it would require going back to the archives); so it’s time to try to get over your shame and be prepared to share it warts and all (and include appropriate caveats).

I’m thinking more in terms of tidying the structure, format, labelling and documentation of the data. So, for example, someone else can import it into database without columns ending up all over the place, or with abbreviations and codes that no one but you can interpret.

  • Ensure it includes the archive references for the originals!
  • Add documentation that explains the dataset (and its limitations) accurately. If you’ve used codes in any fields on a spreadsheet or databse, for convenience in data entry, make sure you add a list of what they actually stand for (or use Find & Replace to change them to something more transparent!).
  • Be clear about what it represents (eg full transcriptions, partial transcriptions, just summaries, etc)
  • Ideally, write the documentation itself in a structured data format to make it machine-readable (“metadata“; also, this intro).
  • Convert it to a non-proprietary format (eg .csv text files); or use one that has good inter-operability for people who don’t have the software you used to create it. Eg, Excel spreadsheets (.xls, .xlsx) can be opened in quite a lot of different packages and converted to other formats quite easily, but Access databases (.mdb) are much more difficult to share. Plain text files (.txt) are preferable to Word docs.

This last point isn’t just about data sharing, by the way, but also about preservation for your own use in the longer term. Do you want to find yourself unable to access all that hard work you did in the archives in just a few years because a manufacturer stopped making that software you used, or the version of it that you were using isn’t supported any more?

And don’t forget, if you share data, you can get credit back too. (Make sure your documentation includes clear citation guidance…) Use a Creative Commons licence or something similar.

At the same time, this isn’t about trying to achieve perfection. (I don’t really advise going and plunging into the data management guidelines at the UK Data Service unless you have a lot of time on your hands…)

There are two resources I plan to try out (hey, I might even blog about the experience):

1. The Dataverse

A repository for research data that takes care of long term preservation and good archival practices, while researchers can share, keep control of and get recognition for their data.

(This may be aimed at social scientists, so I don’t know quite what it’ll be like to use as a historian. Only one way to find out…)

2. OpenRefine (was Google Refine)

a powerful tool for working with messy data, cleaning it, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase.

The School of Data is something I’ve bookmarked recently that may also have something useful.

Right, who’s with me?