This will be a post in two parts about data relating to the series of Westminster Coroner’s Inquests on London Lives, which cover the period 1760-1799. …
Posted at In Her Mind’s Eye
This will be a post in two parts about data relating to the series of Westminster Coroner’s Inquests on London Lives, which cover the period 1760-1799. …
Posted at In Her Mind’s Eye
This post takes a look at an open dataset available through the University of Pennsylvania’s open access repository. The dataset, Indentures and Apprentices made by Philadelphia Overseers of the Poor, 1751-1799 (created by Billy G. Smith), is one of an interesting collection of datasets on 18th- and 19th-century history which I may return to in the future. …
Posted at In Her Mind’s Eye
If you know me, the topic of this first post may come as unsurprising but also a bit eyebrow-raising. “Sharon, you’ve been working on the Old Bailey Online project (OBO) since forever. Aren’t you bored with it yet?” …
Posted at In Her Mind’s Eye
My final data visualisation post for this Women’s History Month is back in the 18th century and takes a look at an open dataset, Vagrant Lives: 14,789 Vagrants Processed by the County of Middlesex, 1777–1786, which was created by Adam Crymble, Louise Falcini and Tim Hitchcock, using data from London Lives.
This dataset makes accessible the uniquely comprehensive records of vagrant removal from, through, and back to Middlesex, encompassing the details of some 14,789 removals (either forcibly or voluntarily) of people as vagrants between 1777 and 1786. It includes people ejected from London as vagrants, and those sent back to London from counties beyond
They’ve already written about this data in an excellent article (open access) and Crymble has blogged further about his ongoing research. (They have better visualisations too, so you could skip this post entirely and go to the real thing. Think of this as a taster.)
I want to focus on ways of visualising multiple categories of qualitative information – the more categories you want to compare at the same time, the more complex a dataviz has to be. In this case, I’ve got four categories to play with: gender, dates, countries of origin, and vagrant ‘types’. That’s to say, there are three types of individual in the dataset: leaders of family groups, their dependents, and single vagrants. The gender of the majority of dependents is unknown (most are children), so for most of this post, I decided to simplify things by filtering out all of the dependents to focus on the group leaders and singles. (As a result, because I’m ignoring about 500 wives who were counted as dependents, the following will differ somewhat from the work referenced above.) This resulted in 10963 individuals.
Overall, the gender ratio of the vagrants looks almost perfectly balanced (5438 female to 5525 male). But this hides some interesting variations.
Firstly let’s break it down by the year of the case. (There are some missing records, and the very small numbers in 1777 and 1779 in particular are due to these gaps.) Two things stand out: the numbers of both female and male vagrants rise rapidly in the mid-1780s; and women are in the majority each year until 1782, after which they’re overtaken by men.
Now looking at vagrant type. As soon as you have multiple categories, you can split up the data in different ways – the “best” can depend on the data and exactly what it is you want to show. So graph 3a compares the percentages of male and female vagrants for each vagrant type, whereas graph 3b shows the percentages of group and single for each gender. 3b highlights that the majority were single individuals – something you wouldn’t know at all from 3a. It also makes it clear that vagrant type was gendered – considerably more men than women were singles. 3a, on the other hand, is better if you want to know exactly what the proportions of men and women were in each type. Most often, if I had to pick just one of these, it’s likely that I’d plump for 3b, because I’ve already seen that overall there are very similar numbers of men and women. But it might be a harder choice if that weren’t the case.
Now, looking at country of origin (British and Irish vagrants only, as there were only a few from other countries ), further striking differences emerge. It’s hardly surprising that the majority of the vagrants came from England, but much more noteworthy that there was such a large disparity between Irish men and women.
Adam Crymble discusses what’s most likely going on, and it ties in with the particularly rapid increase in the numbers of male vagrants from 1783 shown in graph 1 – it’s probably the result of demobilisation after the American wars.
This says ‘demobilisation’ to me, and the male nature of most Irish vagrants suggests that this may have been a strategy for getting home after the war. Demobilisation was heavily centralized in London. Soldiers and sailors weren’t taken home; they were dropped off and left to find their own way.
Finally, I want to visualise the relationships between three categories in the data: gender, country and vagrant type. Mosaic plots are a more complex and less commonly used type of visualisation that can cram a lot more information into a single chart than you can with a bar chart. But, as with boxplots, that makes them a bit harder to interpret.
Imagine that you start with a single large rectangular block. For your first category, you divide it horizontally, and put the labels for each “level” (in this case there are two, F and M, for gender) on the left hand Y axis. As in the very first bar chart, we can see that the proportions of men and women are close to equal.
Then you sub-divide the two blocks vertically for your second category (country) and put the labels along the top X axis. So reading left to right along each gender block, the first vertical block = English, the second = Irish, third = Scottish and fourth = Welsh. Again, we can see that English vagrants are in the majority for both genders, and at the same time, how a much higher proportion of the men are Irish.
Finally you sub-divide the blocks once again, horizontally, for the third category (vagrant type), and the labels for these (group and single) go on the right hand Y axis. The biggest single category, then, is women from England who are single (Hitchcock et al argue the importance of short-distance female migration London to find domestic service for making up much of this). The smallest category is men from Wales who lead a group.
Male Irish and Welsh vagrants are more likely to be single than are men from England and Scotland, whereas a higher proportion of Irish and (even more so) Scottish women were heading groups. (Crymble has also emphasised how different the Irish and Scottish vagrants were.)
The use of colour and shading adds one final dimension, but it’s harder to interpret on first sight. The idea is to show statistical significance. What it boils down to is that blue means the square is bigger than would be expected by the statistical model; red means it’s smaller than the model would expect (and the darker the colour, the bigger the significance). The fact that the group-Irish-male box is coloured dark red (ie, smaller than “expected”) pretty much seems to reinforce what we’ve already observed. The group-Scottish-female box also stands out among the smaller blocks – suggesting that this is significant and might be further investigated.
However, it’s important to to understand whether what the statistical model “expects” is appropriate for the data we have. In medical research, where data collection is conducted according to carefully defined rules, it may be possible to be confident that a statistical significance means a “real” difference. For a historian it might simply be pointing to imperfections in the data! So it’s essential for historians doing data analysis and visualisation to get to grips with both the original sources and the statistics. I’m still grappling with the second part…
More about Mosaic plots and their interpretation:
I note that the website for Civil War Petitions: Conflict, Welfare and Memory during and after the English Civil Wars, 1642 – 1710 is up, with the first batch of petitions (I think) due later this year. And there are still a few days of Women’s History Month to run, so I thought it might be opportune to post a 1665 petition from a soldier’s widow from my old Denbighshire Quarter Sessions files.
The petition of Geelien Cowley ‘a poore widdow and mother of three smale fatherlesse children’:
that your petitioners late husband by name E[dward] Birien of Ruthin a souldier that served in his majestys service in Ireland neare upon three yeares & afterward he retorned to England he served in his majestys service there sixe or seaven yeares where in all these tymes he suffered many ympriso[nments] wounds & brueses wch made him unable to earn his liveliehoode & more especiallie this two yeares last past then he was allowed one of the majestys pensioners to receave a share of his majestys allo[wance] for maymed souldiers provided. Nowe may it please [your] worships to be advertised that the said Edward Birien your petitioners late husband, had a longe sicknesse, beeinge vearie poore & nowe called to gods mercie caused your petitioner to goe upon the credit with her neighbours to suplie her said husbands wants in confidence to receave his share & alloweance of pension as afore is set forth, but it was gods will to take hime to his mercie afore this generall sessions.
Most humbly prayeinge your worships to allowe your petitioner the pencion allotted her late husband for to paye to her creditors what she is engaged for & your worships further help & succours in such sort as your worships thinke meete without your worships comisseracion hearein your petitioner shall not be able to goe amonge good & charitable people for releefe to her & her smale children for feare of arrest or lawsuite. this I humblie bege for gods sacke…
The treasurer of the maimed soldiers’ fund was ordered to pay her the whole quarterly allowance due to her husband.
[NLW Chirk Castle Quarter Sessions files October 1665 B21/d7]
I’ve recently been working on the Digital Panopticon, a digital history project that has brought together (and created) massive amounts of data about British prisoners and convicts in the long 19th century, including several datasets which include heights for women. Adult height is strongly influenced by environmental factors in childhood, one of the most important being nutrition. So,
The height of past populations can thus tell historians much about the conditions that individuals encountered in their formative years. Given sufficient data it is possible to glimpse inside households in order to piece together a history of the impact that declining wages, rising prices, improvements in sanitation and diminishing family size had on mean adult stature.
However, many studies of height and nutrition in 18th- and 19th-century Britain focused on military records and therefore had little to say about women. The turn to using the rich records of heights for men and women (and children) in 19th-century penal records has been more recent.
Today’s post is going to look at height patterns in four Digital Panopticon datasets, mainly using a kind of visualisation that many historians aren’t familiar with: box plots. If you’ve seen them and not really understood them, it’s OK – I didn’t have a clue until quite recently either! And so, I’ll start by attempting to explain what I learned before I move on to the actual data.
A box plot, or box and whisker plot, is a really concentrated way of visualising what statisticians call the “five figure summary” of a dataset: 1. the median average; 2. upper quartile (halfway between the median and the maximum value); 3. lower quartile (halfway between the median and minimum value); 4. minimum value; and 5. maximum value.
Here’s a diagram:
The thick green middle bar marks the median value. The two blue lines parallel to that (aka “hinges”) show the upper and lower quartiles. The pink horizontal lines extending from the box are the whiskers. In this version of a box plot, the whiskers don’t necessarily extend right to the minimum and maximum values. Instead, they’re calculated to exclude outliers which are then plotted as individual dots beyond the end of the whiskers.
So what’s the point of all this? Imagine two datasets: one contains the values 4,4,4,4,4,4,4,4 and the other 1,3,3,4,4,4,6,7. The two datasets have the same averages, but the distribution of the values is very different. A boxplot is useful for looking more closely at such variations within a dataset, or for comparing different datasets, which might look pretty much the same if you only considered averages.
These are the four datasets:
For each dataset, I only included women who had a year of birth, or whose year of birth could be calculated using an age and date, as well as a height. (I say “heights” above because I can’t guarantee that they are all unique individuals; but nearly all of them should be.) In all the following charts I’m including only adult women aged over 19.
Here’s what happens when you plot the heights for each birth decade in RHC.
(This is generated using the R package ggplot2 , and it looks a little bit different from many examples you’ll see online because ggplot has a nice feature to vary the width of the boxes according to the size of the data group.)
The first thing I look for is incongruities that might suggest problems with the data, and on the whole it looks good – the boxes are mostly quite symmetrical and none of the outliers is outside the realms of possibility (the tallest woman is 74.5 inches, or 6 foot 2 1/2, and the shortest is 48 inches), though I’m slightly doubtful that there were women born in the 1800s in this dataset, which gets going in the 1880s; still, they’re a very small number so unlikely to skew things much overall. Since the data seems to be OK on first sight, the interesting thing to note here is that from the 1850s onwards, the women are getting taller, and those born in the 1890s are quite a lot taller than the 1880s cohort. This is fairly consistent with Deb Oxley’s (more fine-grained) observations of the same data.
Again, we have a reasonable spread of heights and fortunately very small number of slightly questionable early births. (It happens to be the case that this data was manually transcribed, whereas RHC was created using Optical Character Recognition – but on the other hand, the source for RHC was printed and much more legible than the handwritten indents.) Ignoring for now the very small groups before the 1770s, the tallest decade cohort of women in this data is those born in the 1790s and thereafter they get consistently shorter.
Let’s put all four datasets together! (click on the image for a larger version)
I’ve filtered out women born before 1750 and after 1899, because the numbers were very small, and some extreme outliers (more about those later…). Then I added a guideline at the median for the 1820s (the mid-point), as I think it helps in seeing the trends.
It might seem surprising at first that the late 18th-century women of HCR are taller than any subsequent cohorts until the 1890s. Yet the trends here are broadly consistent with the pioneering research by Roderick Floud et al on British men and boys between 1740 and 1914. They argued “that the average heights of successive birth cohorts of British males increased between 1740 and 1840, fell back between 1840 and 1850, and increased once again from the 1850s onwards” (Harris, ‘Health, Height and History’). The British population was less well-fed for much of the 19th century (as food resources struggled to keep up with rapid population growth), and it got smaller as a result. Our women’s growth after 1850 may be slower than for the men (until the 1890s) though; perhaps it took longer for women than men to start growing again.
Finally, though, I have to put in a big caveat about the HCR data. I mentioned that I excluded some extreme outliers from the chart above. HCR was by far the worst offender, and if you look closely at the 18th-century cohorts covered by HCR, the boxes aren’t quite as symmetrical as the 19th-century ones. If we visualise it using a histogram (another handy one for examining the distribution of values in a dataset), we can see more clearly that there’s something up. A ‘normal’ height distribution in a population should look like a “bell curve” – quite tightly and symmetrically clustered around the average. CIN and RHC are close:
But this is what HCR looks like. This is not good.
If we’re lucky, much of the problem could turn out to be errors in the data which can be fixed. After all, it’s at least roughly the right kind of shape! The big spike at 60 inches (5 feet) rings plenty of alarm bells though. It looks reminiscent of a problem we have with much of the age data in the Digital Panopticon, known as “heaping“, a tendency to round ages to the nearest 0 or 5 (people often didn’t know their exact dates of birth). The age heaping is very mild in comparison to this spike, so I think it could well be another issue with either the transcription or the method used to extract heights. But if it turns out that’s not the case, this could be pretty problematic. We’re assuming the prisoners were properly measured, but we don’t know anything about the equipment used. For all we know, it might often have been largely guess work. In the end, we might find that HCR simply isn’t reliable enough to use for demographic analysis. There’s very little height data for women born in the 18th century, so this is a potentially really important source. But what if it’s not up to the job?
John Canning, Statistics for the Humanities (2014), especially chapter 3.
H Maxwell-Stewart, K Inwood and M Cracknell, ‘Height, Crime and Colonial History’, Law, Crime and History (2015).
Deborah Oxley, David Meredith, and Sara Horrell, ‘Anthropometric measures of living standards and gender inequality in nineteenth-century Britain’, Local Population Studies, 2007.
Bernard Harris, ‘Health, Height, and History: An Overview of Recent Developments in Anthropometric History’, Social History of Medicine (1994).
Jessica M. Perkins et al, ‘Adult height, nutrition, and population health’, Nutrition Reviews (2016).
Today I want to go on an excursion in “catalogues as data“. The UK National Archives’ Discovery catalogue is an excellent resource for this activity, because a) it has a lot of records that have document descriptions at ‘item’ or ‘piece’ level in the catalogue, containing quite structured information (like dates, places, occupations) that can be quantified and visualised; and b) even more importantly, it has an export function that allows you to download up to 10,000 records in CSV format. (It also has a full API for those with some programming skills, but 10,000 records will get you a long way, and you can often break up larger collections into chunks, eg with date filters).
You’ll need to use the Discovery advanced search quite carefully to get the right set of search results (it enables specification of particular records, dates, catalogue level, etc) – there are some useful tips here. Then you’re quite likely to need to use a tool like OpenRefine to separate out pieces of information into separate data fields and clean/normalise dates etc (check out this tutorial).
the service records of more than 7,000 women who joined the Women’s Army Auxiliary Corps (WAAC) between 1917 and 1920… The WAAC became the QMAAC in April 1918 and was disbanded in September 1921
At 7000 records, this sounded like a good size set to play around with, well within the download limits. And a look at a catalogue entry showed that it has some nice information beyond women’s names (unlike a similar and larger series, WO399, which has only transcribed names). Given just a few hours work extracting and cleaning the data, what could I learn?
|Record for||Aaron, Sarah Ann nee Phillips|
|Place of Birth:||High Street Cefn Mawr, North Wales|
|Date of Birth:||22 August 1894|
First, what does this actually offer in terms of usable data? The date of birth is an obvious one: closer inspection shows that it’s in a consistent format where there’s a full date (the majority); at least a year is provided in almost every case, and that can be extracted into a standard year of birth field quite easily. Place of birth also has potential, but it’s more varied and needs more cleaning, so I haven’t done anything with that yet; but it could make for an interesting mapping exercise. Less obviously perhaps, “nee Phillips” suggests that – if you can safely assume women always gave this information! – it’s possible to also infer something about whether a woman was (or had been) married. Another nice little thing you could also potentially do, given birth dates and first names, is to look for patterns in baby naming (although this might really need a larger dataset).
Two caveats, one major and one more minor:
Visualisation is often particularly useful for highlighting errors and problems in your data. But it’s the researcher who has to decide what to do about such anomalies (and whether they might even be serious enough to make the whole dataset too unreliable to be worth using).
I initially hoped that the record dates would represent specific dates when women joined up, but as it turned out there was only a covering date for the series as a whole. Since it only covers 4 years, that’s not really an issue; instead I simply worked out their ages in 1918 (assuming that there wouldn’t have been new recruits after the war ended anyway), and filtered out the half-dozen supposedly born before 1860 or after 1903.
And so the thing I learned today is that, gosh, they were so young.
As visualisations, tables may be less eye-catching than graphs, but they have the virtue of presenting a lot of precise information in a relatively small space; the table at the bottom of this post shows that more than 60% of the women were aged 25 or under in 1918 and about 90% were under 30. Very few of them were old enough to take advantage of the limited extension of voting rights to women at the end of the war.
This is confirmed by a bit of background reading – according to Lucy Noakes on Women’s Mobilization for War (Great Britain and Ireland), “the majority of recruits to the WAAC were young working class women”. If we can reasonably assume that the information given about maiden names is a complete record, or anywhere near it, the vast majority of the women were also unmarried – nearly 95% of them overall. I suspect that very few married women would have volunteered for this type of service (which was likely to take them overseas and close to combat), and as a result it might be expected that the majority would be young – very likely younger, on average, than male soldiers. You can also see that a considerably higher proportion of the women aged over 25 were/had been married – but it still looks a very low proportion compared to what you might expect in the general population (and I wonder if quite a lot of these were widows).
I’m not exactly surprised to learn from Noakes that their youth (and, no doubt, class) resulted in some negative perceptions:
In the public mind however, they were sometimes perceived as thrill seekers, drawn by a desire for adventure and romance, and recruitment to the service suffered from fears that women were finding opportunities for sexual liaisons with the soldiers. So worried was the government by these rumours that a Commission of Enquiry was formed, which included figures showing the number of pregnancies amongst unmarried members of the WAAC was lower than among unmarried civilians…
The ages of women recruited to the WAAC/QMAAC, 1917-18
|Age in 1918||number of women||% of total|
The second instalment in this series of data visualisation posts for Women’s History Month 2018 looks at the World Bank World Development Indicators (WDI). This massive collection has data in several categories: demographic, education, work, poverty, health. It includes both country-level data and various aggregates by different criteria: geographical regions, income levels, etc. The UK Data Service has a useful guide as well as access to the data. You can also download it directly from the World Bank website (and it has an API which I haven’t tried), and there are tools like R packages.
A lot of the data is relevant to women’s and gender history, so much so that gender has its data portal. I’ve selected just a handful of significant indicators with the most comprehensive coverage (life expectancy, fertility, education), and I’ve done two series of graphs: the first uses the World Bank income level groupings, and the second takes a selection of six countries (chosen because they are varied geographically, culturally, in terms of income and in their data patterns, and because they have good data coverage, not for any kind of representativeness).
I’m sure there are no surprises here for people who study global development, but for me at least it’s been an educational experience. There seems to be quite a lot of good news for women in this data. The bad news is the sheer levels of inequality between income regions in many of the indicators.
Life expectancy is one of the most long-running series in the data; most countries have it from 1960 onwards. This is a ‘faceted’ graph comparing female and male life expectancy at birth in the five income groups (and the world as a whole). The familiar observation that women live longer than men is not just a “Western” phenomenon, although it appears that the wealthier the country, the bigger the gap. The level of the continuing gap between the richest and poorest countries is one thing that has not much changed.
The second graph uses the same format to look at the six countries, and at country level there is more variation – even dips or periods of stagnation that counter the general upward trends.
(Oops. I forgot to make nice labels for the y axes. That should be “years from birth”.)
This graph shows the data for the six countries in a different way. I think it’s a bit less clear in some respects, but it’s useful for comparing the countries at particular times, and how their trajectories vary.
Fertility rates are also widely available from 1960. Women everywhere are having fewer babies than they were 60 years ago, but the most rapid falls have been in middle income countries. Again, the six countries show more variation.
The World Bank’s earliest education data starts a bit later than the demographic indicators, from 1970 onwards. The most notable feature of this is the convergence across every income group at primary school level, and secondary education is not far behind except in the poorest countries. (Watch out for the difference in the scale of the y axis, and the perspective, of the two charts…)
There are a few more gaps appearing in the country data, especially in the secondary school data. Gaps can be frustrating of course, but they’re important for highlight questions I’d need to be asking about how the data was collected, and about the calculation of aggregates.
Mostly, employment data in the WDI doesn’t get going until the 1990s. But the education data does contain some information about the gender of primary school teachers. (One of the oddest gaps in the country data is that this isn’t recorded for Norway. The data for secondary school teachers is even bittier and I decided not to include it.)
I already knew that in Britain primary school teaching is a heavily female profession, but I didn’t quite realise how far this extends across high income countries generally. The global trend seems to be in the same direction, but much more slowly in the low income countries group.