Home
Production
Descriptions of OCR processing, TEI transformation, and metadata creation
Master Files
After creating entry files, we run an automated metadata generation process to add index terms to each entry.
Automated Metadata Procedure
How we create subject headings for every entry file.

Production
Descriptions of OCR processing, TEI transformation, and metadata creation
- Organization
  How to keep hundreds of thousands of files organized.
- Page Files
  Explains the procedures we use to get the best quality OCR of each page.
- Entry Files
  Procedures for converting single pages into Encyclopedia entries.
- Master Files
  After creating entry files, we run an automated metadata generation process to add index terms to each entry.
  - Automated Metadata Procedure
    How we create subject headings for every entry file.
    - Named Entity Recognition
      Running named entity recognition on entries
    - Generating the Manifest
      The manifest stores all of the parameters for HIVE
    - Run HIVE2
      With the NER output, plain text, and a manifest in place, we are ready to generate the index terms.
    - Add index terms to master files
      Adding index terms to the master files

Automated Metadata Procedure

How we create subject headings for every entry file.

We rely on several different Python scripts and the HIVE2 vocabulary server to automatically generate subject terms for each entry. The scripts assume a duplicate directory structure with three parallel directories as follows in the main repository for each edition.


directory	description
entry	TEI files, one for each encyclopedia entry.
metadata	TXT and CSV files used to generate subject terms for each entry.
master	TEI files with their subject terms written into the TEI Header.

Once we establish the edition and letter to be processed, the script will move along the tree within the appropriate one of these three directories to find what it needs.