Introduction

What is the Nineteenth-Century Knowledge Project?

Objective

The Nineteenth-Century Knowledge Project uses historic editions of the Encyclopedia Britannica to build one of the most extensive, open, digital collections available today for studying the structure of nineteenth-century knowledge and its transformation. The effort is headed by Professor Peter M. Logan, Temple University, and supported by the Digital Scholarship Center of Temple Univesity Libraries. We work closely with the Metadata Research Center at Drexel University to develop a comprehensive metadata scheme for all entries.

The different editions of the EB were the most comprehensive representation extant of what constituted official knowledge throughout the nineteenth century. Those editions also demonstrate changes over time in the nature of knowledge in the English-speaking world. These works are already available on the web, but the existing textual data derived from the image files is too inaccurate to be used for text mining. The Knowledge Project is creating the first accurate, standards-compliant textual dataset for this corpus.

We extend the collection's usability by applying innovative methods to automatically generate metadata for each of the 100,000 entries, which will be tagged with both current and historical subject categories. At the end of the project, all of the data will be made freely available, and a series of experiments will be conducted to identify the feasibility of tracking concept drift across time within the corpus.

Production Method

The project begins with running OCR on existing image files of Encyclopedia pages. The text is first output as single pages in docx format, before being converted into XML. The data is encoding with the standards developed by the TEI. Python is used to combine the TEI pages into a single file and segment it into separate files for each entry in the EB. Finally, we add topical metadata to each entry.

These master files for the new digital edition can then be used to generate editions in a variety of formats, from xhtml to epub to TXT. Finally, we run textual analysis routines on the data.

Accessibility

When complete, the data set will include all text for approximately 100,000 entries. This material will be made freely available online through two public repositories, both in a user-friendly individual entry form, and in a bulk download form for researchers.

Contact Information

If you want to learn more about the Knowledge Project, are interested in contributing to it, or would like to ask about utilizing the data set, please contact the project lead, Prof. Peter M. Logan at peter.logan@temple.edu.

Acknowledgments

Created in 1965 as an independent federal agency, the National Endowment for the Humanities supports research and learning in history, literature, philosophy, and other areas of the humanities by funding selected, peer-reviewed proposals from around the nation. Additional information about the National Endowment for the Humanities and its grant programs is available at: www.neh.gov.