Organization

With hundreds of thousands of files, the Nineteenth-Century Knowledge Project needs a clear means of organizing its data. We use specific naming conventions for all files and folders to order the material.

Repositories

Files are stored in three different repositories, depending on the average size of the files and how frequently they change.

name location contents
archive HDD stores image files from multiple scans of different Encyclopedia Britannica editions
kp1 Google Shared Drive kp stands for Knowledge Project. This repository is number 1 because it contains the OCR files used in the first phase of production. We also store images, information, and records here.
kp2 GitHub This repo contains the output of the OCR process and everything that follows in creating the TEI master files. It also includes derivatives, metadata files, code, and analytics.

Processing sequence

We number certain folders used in the process of creating the master files for digital editions, to show their position in the Knowledge Project workflow. Their names and functions are given below:

repository foldername function
kp1 1-afr-project ABBYY FineReader has a proprietary compressed folder structure for storing its OCR data. See 1-afr-project Folder.
kp2 2-page-docx We save our OCR results in Word's docx format, with text from one printed page per file. See 2-page-docx Folder.
kp2 3-page-tei Each docx file is transformed into the TEI format as a page file, with each files containing text from one printed page. See 3-page-tei Folder.
kp2 4-entry-tei The TEI page files are combined and segmented into entry files, with each file containing a single, complete entry. See Convert Pages to Entries
kp2 5-entry-md Each entry file is processed by HIVE which auomatically generates subject headings for the file and outputs them as a csv file. These entry metadata files are stored here.
kp2 6-master-tei We use Python to import the subject terms from the 5-entry-md files into the TEI Header of the 4-entry-tei files. The result is a properly-encoded TEI file with relevant subject headings for every entry in the EB.