Reading 35,000 Books: UCD CS faculty collaborates with UCD School of English to allow humanities scholars better explore our cultural past
Dr. Derek Greene is an Assistant Professor in the UCD School of Computer Science and a Funded Investigator at the SFI Insight and VistaMilk research centres. Prof. Gerardine Meaney is Professor of Cultural Theory in the UCD School of English, Drama and Film.
In recent years, the potential for collaboration between data science and other disciplines to develop new research methods is being increasingly recognised. This is particularly evident in the development of cultural analytics in the field of Digital Humanities, where available datasets and other digital resources for humanities research have expanded rapidly in the last decade. Since 2011 there has been an active collaboration between the School of Computer Science and the School of English. Using advanced Data Science techniques, we have approached literary sources with new questions and come up with some surprising findings.
Initially, our work focused on developing network analysis models to represent the associations between characters in 19th and early 20th century Irish and British fiction (http://www.nggprojectucd.ie). In this case, the data analysed was a corpus of annotated full-texts for 46 novels, with over 9000 named entities, a very large data set for literary studies. For a novel like Pride and Prejudice, we were able to identify ways in which seemingly minor characters were in fact crucial in bringing about major plot devices. Understanding this, we can now revisit these characters and increase our knowledge of the society in which Jane Austen wrote.
More recently, as part of the IRC-funded Contagion project (http://www.contagion.ie/), our focus has shifted to analysing historical trends at a larger scale. Through a collaboration with the British Library Labs (https://www.bl.uk/projects/british-library-labs), we have access to a much larger corpus from the British Library, covering 35,918 English language fiction and non-fiction books dating from 1700 to 1899. This is equivalent to over 12 million individual pages of printed text. For this project, we wanted to explore historical understandings of disease, contagion and migration, in order to better understand current public health challenges.
A project like this, where a huge corpus is available to researchers, presents significant challenges to humanities researchers who are studying a very specific theme. In order to understand society’s understanding of these themes, the more discussions of them that we can analyse, the better. This means looking at texts where they are only mentioned briefly, fiction books where they are used as plot devices, and non-fiction texts where they are the central theme. Analytics can also be used to establish how these themes are discussed.
The books were originally digitised to image format and then converted to plain text via optical character recognition (OCR). As a result, the quality and formatting of the text varies considerably, particularly in the case of older books. The British Library also provided metadata for the corpus, including information such as author, edition and place of publication for each book.
In February 2019, we organised a workshop at the British Library in London to showcase the Curatr platform, a web-based interface which we developed to make the British Library corpus more accessible and useful to a wider group of researchers. The platform indexes all of this text and the associated metadata, allowing the corpus to be browsed, searched, and filtered by author, title, and year. The interface also incorporates a digitised version of the topical classification index of volumes used by the British Library from 1823-1985, which allows the texts to be further filtered by categories such as “fiction”, “drama”, and “geography”.
Curatr incorporates functionality to build word lexicons. These are lists of thematically-related keywords, which are used to locate niche research topics within little known or long unwieldy texts. To reduce the manual effort required to build new word lexicons, we provide users with automatic keyword recommendations, as generated by word embeddings. Word embeddings refer to a set of machine learning techniques, based on neural networks, which "map" the words in a corpus vocabulary to a numeric representation. In this new representation, words which frequently appear together in the original corpus will appear to be similar to one another, while words which do not frequently appear together will be dissimilar. So for example, for the input word “influenza”, we could automatically recommend similar words such as “pneumonia” and “bronchitis”. In this way, researchers can quickly build lexicons of related words.
Researchers working with the British Library corpus have previously attempted to curate smaller sub-corpora related to specific topics or interests. This has often been a painstaking task requiring considerable manual effort to inspect the corpus. To make this process easier, Curatr also supports the creation of sub-corpora, defined thematically, chronologically, and by classification. Once a curated sub-corpus of texts has been identified, the associated texts and metadata can be easily exported to other platforms for further research and for more traditional “close reading”.
In the next phase of our work, our plan is to make the Curatr tool publicly available for wider research use, and to potentially extend the platform to include other text humanities corpora and different types of content, such as texts from historical newspaper archives.
The Contagion project is funded through the Irish Research Council and Insight Centre for Data Analytics