From Scan to csv – a Digital Humanities Project

On 15 January, I had the opportunity to present some exciting work in progress at the EdinbR user group – here are the details of my talk:

How have birth rates changed over time, in different places, and in different social groups? We can answer these questions with large-scale data for the relatively recent past, but handwritten historical records are currently not easily accessible.

This pilot study explores the potential of digitising historical 19th century records from Austria for such statistical analyses, using a handwritten text recognition software, Transkribus (https://transkribus.eu/Transkribus/).  This work-in-progress-talk focuses on the final step in the data extraction workflow, the processing of the exported text: How can we clean messy textual data in tables, which still contains spelling errors? Which keyword spotting and classification or clustering tools could be applied? How can we extrapolate social information like for example the sex of the baby or the parents’ social status?

You can find the slides for the talk here, and the R notebook is given below and linked here.

 

library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.5
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## Warning: package 'tibble' was built under R version 3.5.2
## Warning: package 'dplyr' was built under R version 3.5.2
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(readxl)

Read xlsx files and create data frames

import_notebook <- function(filename){path <- filename
sheets <- path %>%
  excel_sheets() %>% 
  set_names()
ranges <- list(cell_rows(1:8)) #For the hand-drawn tables I can select only the relevant rows (exclude the header) with this line of code, BUT for the batch processed ones the two top rows are not reliably the header, so there I need to load the full range of rows
df <- map2_df(
  sheets,
  ranges,
  ~ read_excel(path, sheet = .x, range = .y, col_names = c("Geburts und Tauftag", "Taufender", "Name Kind", "Katholisch", "Protestantisch", "Maennlich", "Weiblich", "ehelich", "unehelich", "Vater", "empty", "Mutter", "Ort", "Paten", "Hebamme", "Anmerkungen")),#define header names
  .id = "page",#add id column with page name
  filename = path #add column with the filename
) %>%
  select(-empty)

  return(df)
}

births_fully_manual <- import_notebook("20200114_TrainingSet_handCorrected_pp1-44_uniform.xlsx") %>%
  add_column(annotation_type = "fully manual")

births_tables_manual <- import_notebook("20200114_TrainingSet_KurrentM1_pp1-44_uniform.xlsx") %>%
  add_column(annotation_type = "tables manual")

births_fully_automated <- import_notebook("20191211_TrainingSet_6rowTemplate_fixed_tables.xlsx") %>%
  add_column(annotation_type = "fully automated")%>%
  mutate(page_number = as.numeric(str_extract(page, "\\d{2}")))%>%
  filter(page_number %in% (1:44))%>% #subset to only those pages that I have already hand-corrected
select(-page_number)

extract YOB before deleting header rows

births_fully_manual <- births_fully_manual%>%
  mutate(Jahr = as.numeric(str_extract(`Geburts und Tauftag`, "\\d{4}")))%>%
  drop_na(Jahr)%>%
  filter(Jahr %in% (1860:1904))%>% #define overall possible range of years
  distinct(Jahr, page)%>% 
  full_join(births_fully_manual)
## Joining, by = "page"
births_tables_manual <- births_tables_manual%>% 
  mutate(Jahr = as.numeric(str_extract(`Geburts und Tauftag`, "\\d{4}")))%>%
  drop_na(Jahr)%>%
  filter(Jahr %in% (1860:1904))%>%
  distinct(Jahr, page)%>% #we get 37 entries, BUT there are two conflicting ones on severl pages -> need to get it down to only one
  group_by(page)%>%
  slice(1L)%>% #simply take the first YOB logged (should be refined!)
  #YOB gets recognised on 31 pages 
  full_join(births_tables_manual)
## Joining, by = "page"
births_fully_automated <- births_fully_automated%>% 
  mutate(Jahr = as.numeric(str_extract(`Geburts und Tauftag`, "\\d{4}")))%>%
  drop_na(Jahr)%>%
  filter(Jahr %in% (1860:1904))%>%
  distinct(Jahr, page)%>% #we get 37 entries, BUT there are two conflicting ones on severl pages -> need to get it down to only one
  group_by(page)%>%
  slice(1L)%>% #simply take the first YOB logged (should be refined!)
  #YOB gets recognised on 31 pages 
  full_join(births_fully_automated) 
## Joining, by = "page"

join the three data frames and delete header rows

births_all_versions <- births_fully_automated %>%
  full_join(births_fully_manual)%>%
  full_join(births_tables_manual) %>%
  filter(!is.na(`Name Kind`)&!is.na(Taufender)&!is.na(Hebamme)) %>% #drop rows where no child, no priest, and no midwife are recorded
  drop_na(Mutter) %>% #drop rows with no entry for the mother
  filter(!str_detect(Hebamme, ".e.amme")) %>% #detects those rows that contain versions of "Hebamme" and excludes them (first header row)
  filter(!str_detect(Ort, "Ort"))%>% #detect first header rows
  filter(!str_detect(Taufender, ".amen"))%>%
  filter(!str_detect(Taufender, "Taufenden"))%>% # detects second header rows based on "Taufender"
  filter(!str_detect(`Name Kind`, ".et.u.ten"))%>% #detect second header row based on the column "child's name"
  filter(!str_detect(Hebamme, "Zu.am.")) #detects rows that contain "Zuname" in the Hebamme column (ie second header rows) and excludes them
## Joining, by = c("Jahr", "page", "Geburts und Tauftag", "Taufender", "Name
## Kind", "Katholisch", "Protestantisch", "Maennlich", "Weiblich", "ehelich",
## "unehelich", "Vater", "Mutter", "Ort", "Paten", "Hebamme", "Anmerkungen",
## "annotation_type")Joining, by = c("Jahr", "page", "Geburts und Tauftag",
## "Taufender", "Name Kind", "Katholisch", "Protestantisch", "Maennlich",
## "Weiblich", "ehelich", "unehelich", "Vater", "Mutter", "Ort", "Paten",
## "Hebamme", "Anmerkungen", "annotation_type")

extract interesting information

To do something with the data, we would like to extract various pieces of information:

Information that needs to be extracted from individual columns:

  • name of the child (match with list of common names)
  • father’s occupation (match with list of occupations)
  • place of birth (match with list of place names)
  • parents’ place of birth (match with list of place names)

Information that is theoretically logged specifically in its own column:

  • religion (catholic/protestant)
  • sex of the child (male/female)
  • in or out of wedlock (ehelich/unehelich)

NOTE: 1. in the automatic table matching, those six columns are recognised rather unreliably 2. even if the columns are recognised/ hand-annotated, the tick marks in them are not recognised very reliably AND the columns may contain text that is bleeding across several cells (basically the priest and parish office staff not sticking to the table structure but writing all over the place)

This means we can’t really use the export from those six columns and need to extrapolate this information from other entries instead.

  • religion: based on father’s religion (match with variants of “catholic” and “protestant”)
  • sex of the baby: based on child’s name
  • wedlock: based on the entry for the father containing the word “Zeugen” (witnesses)
listOfnames <- read.csv(file = "names_gender.csv") #read the list of common names with sex assigned
names_string <- str_c(listOfnames$Name, collapse = "|")

listofjobs <-read.csv(file = "berufsliste.csv")
jobs_string <- str_c(listofjobs$Beruf, collapse = "|")

listofplaces <- read.csv(file = "placenames.csv")
places_string <- str_c(listofplaces$placename, collapse = "|")

religion_vec <- c("kat..", "Kat..", "prot", "Prot", "k.R.", "k.th", "k.Rel")
religion_string <- str_c(religion_vec, collapse = "|")

births_all_versions <- births_all_versions %>%
  ungroup() %>%
  mutate(Name = as.factor(str_extract(`Name Kind`, names_string)),
         Religion = as.factor(str_extract(Vater, religion_string)),
         wedlock = as.factor(str_extract(Vater, "Zeugen")),
         job_father = as.factor(str_extract(Vater, jobs_string)),
         job_mother = as.factor(str_extract(Mutter, jobs_string)),
         place_birth = as.factor(str_extract(Ort, places_string)),
         origin_father = as.factor(str_extract(Vater, places_string)),
         origin_mother = as.factor(str_extract(Mutter, places_string))) 

compare results across the three types of export

nrows_per_type <- births_all_versions %>% #create df with number of rows per type
  group_by(annotation_type)%>%
  count(name = "n_rows")

summary_birth_records <- births_all_versions %>% 
    mutate(page = as.factor(page),
         annotation_type = as.factor(annotation_type))%>% #make sure everything relevant is a factor
  select(Jahr, Name, Religion, wedlock, job_father, job_father, place_birth, origin_father, origin_mother, annotation_type) %>% #select the variables we are interested in, NOTE: "annotation type" becomes our grouping variable
  left_join(listOfnames) %>%
  group_by(annotation_type) %>%
  summarise_all(funs(sum(!is.na(.)))) %>%
  left_join(nrows_per_type)
## Joining, by = "Name"
## Warning: Column `Name` joining factors with different levels, coercing to
## character vector
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
## Joining, by = "annotation_type"
## Warning: Column `annotation_type` joining factor and character vector,
## coercing into character vector
summary_birth_records
annotation_type
<chr>
Jahr
<int>
Name
<int>
Religion
<int>
wedlock
<int>
job_father
<int>
place_birth
<int>
origin_father
<int>
origin_mother
<int>
n_rows total <int>
fully automated 66 25 94 8 68 27 97 120 155
fully manual 141 73 141 8 88 67 130 158 191
tables manual 114 24 112 10 80 31 117 149 190

visualise the key differences

summary_birth_records%>%
  select(annotation_type, Jahr, Name, wedlock, job_father, place_birth, n_rows)%>%
  mutate(annotation_type = ordered(annotation_type, levels = c("fully automated", "tables manual", "fully manual")))%>%
  gather(key = "variable", value = "n_rows_recognised",-annotation_type, -n_rows)%>%
  mutate(perc = n_rows_recognised / n_rows) %>%
  ggplot(aes(x=variable, fill = annotation_type)) + #reorder the frequency from highest to lowest for plotting
    geom_col(aes(y=perc), position = "dodge") + #create a column chart, where the height of the bars on the y-axis is determined by the variant frequency we have just calculated
   scale_y_continuous(labels=scales::percent, limits=c(0,1)) + #convert the y axis to percent; the "limits" command scales the y axis to range from o to 100% (because the axis is based on proportions, not percentages, 1 is what we need to refer to 100%)
    ylab("Percent rows with information")+ #give the y axis a name
    xlab(NULL)+ #give the x axis a name
    ggtitle("Information captured with different methods")+ #give the plot a title
    #geom_text(aes(y = perc + .06,    # nudge above top of bar
    #              label = paste0(round(perc*100, 1), '%')),    # prettify
              #nudge_y = .001, 
    #        size = 2.5)+ #this code adds percentage labels on top of the bars
  #facet_wrap(~annotation_type)
  NULL

Some plotting options

Based on the “fully manual” data, we can explore some developments in terms of births:

births over time

births_all_versions %>%
  filter(annotation_type=="fully manual")%>%
  ggplot(aes(x=Jahr))+
  geom_bar()
## Warning: Removed 50 rows containing non-finite values (stat_count).

By year and place

births_all_versions %>%
  filter(annotation_type=="fully manual")%>%
  ggplot(aes(x=Jahr, fill = place_birth))+
  geom_bar()
## Warning: Removed 50 rows containing non-finite values (stat_count).
## Warning: position_stack requires non-overlapping x intervals

By year and father’s occupation

births_all_versions %>%
  filter(annotation_type=="fully manual")%>%
  ggplot(aes(x=Jahr, fill = job_father))+
  geom_bar()
## Warning: Removed 50 rows containing non-finite values (stat_count).
## Warning: position_stack requires non-overlapping x intervals

 -> of course these would need to be coded for social status to be turned into a more meaningful picture

births_all_versions %>%
  filter(annotation_type=="fully manual")%>%
  ggplot(aes(x=Jahr, fill = wedlock))+
  geom_bar()
## Warning: Removed 50 rows containing non-finite values (stat_count).
## Warning: position_stack requires non-overlapping x intervals

Autor:

„It’s all about the Interaction!“ Qualitatively understanding and (more recently) quantitatively modelling interaction between people has always fascinated me. My PhD dissertation – It’s all about the interaction – Listener Responses as a Discourse-Organisational Variable – focuses on the phenomenon of Listener Responses, commonly called backchannels.