With hundreds of thousands of files, the Nineteenth-Century Knowledge Project needs a clear means of organizing its data. We use specific naming conventions for all files and folders to order the material.


Files are stored in seven different repositories, one each for the OCR files for the four editions, and three for supplemental and derivative files.

name location contents
archive HDD Image files from multiple scans of different Encyclopedia Britannica editions
eb03 Google Drive ocr-project files
eb07 Google Drive ocr-project files
eb09 Google Drive ocr-project files
eb11 Google Drive ocr-project files
metadata Google Drive Contains the metadata folder
outputs GitHub This repo contains the output from the OCR process and everything that follows in creating the TEI master files. It also includes derivatives, code, and analytics.

Numbered folder names

We number certain folders used in the process of creating the master files for digital editions, to show their position in the Knowledge Project workflow. Their names and functions are given below:

foldername repository function
1-afr-project eb03, eb07, eb09, eb11 ABBYY FineReader has a proprietary compressed folder structure for storing its OCR data. See 1-afr-project Folder.
2-page-docx outputs We save our OCR results in Word's docx format, with text from one printed page per file. See 2-page-docx Folder.
3-page-tei outputs Each docx file is transformed into the TEI format as a page file, with each files containing text from one printed page. See 3-page-tei Folder.
4-entry-tei outputs The TEI page files are combined and segmented into entry files, with each file containing a single, complete entry. See Convert Pages to Entries
5-entry-md outputs Each entry file is processed by HIVE which auomatically generates subject headings for the file and outputs them as a csv file. These entry metadata files are stored here.
6-master-tei outputs We use Python to import the subject terms from the 5-entry-md files into the TEI Header of the 4-entry-tei files. The result is a properly-encoded TEI file with relevant subject headings for every entry in the EB.