Home
Production
Descriptions of OCR processing, TEI transformation, and metadata creation
Master Files
After creating entry files, we run an automated metadata generation process to add index terms to each entry.
Automated Metadata Procedure
How we create keywords for every entry file.
Named Entity Recognition
Running named entity recognition on entries

Production
Descriptions of OCR processing, TEI transformation, and metadata creation
- Organization
  How to keep hundreds of thousands of files organized.
- Page Files
  Explains the procedures we use to get the best quality OCR of each page.
- Entry Files
  Procedures for converting single pages into Encyclopedia entries.
- Master Files
  After creating entry files, we run an automated metadata generation process to add index terms to each entry.
  - Automated Metadata Procedure
    How we create keywords for every entry file.
    - Named Entity Recognition
      Running named entity recognition on entries
    - Generating the Manifest
      The manifest stores all of the parameters for HIVE
    - Run HIVE2
      With the NER output, plain text, and a manifest in place, we are ready to generate the index terms.
    - Add index terms to master files
      Adding index terms to the master files

Named Entity Recognition

Running named entity recognition on entries

Create a batch folder for processing the group of files in the outputs\metadata directory, using the same alphabetical folder tree as the entry files.
outputs\metadata\eb03\a\
In Oxygen XML Editor, use tei-to-text.xsl to create TXT versions of the XML entry files. Keep the same base filename. Output the TXT files to the batch folder. We will use these to generate our NER data and to process it in HIVE2.
To generate the NER data, use Stanza, the Stanford NLP Group’s current natural language processing toolkit for Python, which includes Stanford NER. Run Stanza on the TXT batch using two different NER subsets:
1. General entities (NER Topics). Save results in the batch folder as csv files, with a appended to the base filename.
2. Locations (NER Geo). Save results in the batch folder as CSV files, with b appended to the base filename.

Each entry will now have three files in the batch folder:


each entry	full text	kp*.txt
each entry	NER Topics	kp*a.csv
each entry	NER Geo	kp*b.csv