Home
Production
Descriptions of OCR processing, TEI transformation, and metadata creation
Organization
How to keep hundreds of thousands of files organized.
Repositories
A guide to the different repositories used to store ocr-project data.
GitHub
Used for data files.
outputs Repository
Explains the content of the outputs repository.
code Folder
The central repository for program code and transformation scripts.
Python Folder
Python script files.

Production
Descriptions of OCR processing, TEI transformation, and metadata creation
- Organization
  How to keep hundreds of thousands of files organized.
  - Edition-Section System
    File organization depends on two basic folder types
  - Folder names
    As the OCR workflow passes through its various stages, production moves into specific folders for each stage. Their names and contents are given below:
  - Repositories
    A guide to the different repositories used to store ocr-project data.
    - Google Drive
      Used for image files.
    - GitHub
      Used for data files.
      - outputs Repository
        Explains the content of the outputs repository.
        autoindex folder
        Collects all materials needed for indexing entry files.
        code Folder
        The central repository for program code and transformation scripts.
        Python Folder
        Python script files.
        digital-editions Folder
        Storage area for editions generated from the master files.
        entry Folder
        Contains the TEI-encoded data after the page files are converted to entry files.
        master Folder
        Contains the master files for creating digital editions.
        metadata Folder
        A collection of files containing metadata for each entry page.
        page Folder
        Pages are individual printed pages in the Encyclopedia.
        records Folder
        A collection of spreadsheets and other documents recording details of the production and analytical work.
    - archive Repository
      Long-term storage of image files
  - Setting Up the Repositories
    Create local copies of the remote repositories
- Page Files
  Explains the procedures we use to get the best quality OCR of each page.
- Entry Files
  Procedures for converting single pages into Encyclopedia entries.
- Master Files
  After creating entry files, we run an automated metadata generation process to add index terms to each entry.

Python Folder

Python script files.

After we create clean digital text for each page of the source text, we still have several critical changes to make to them before they can be used as digital edition masters. Both of these changes are done using a Python script.

ABBYY FineReader outputs each individual page as a separate file. We need this format in order to create the original page numbers in the digital edition, but we do not want the edition to consist of individual pages. For the Encyclopedia, the entry is the basic unit of meaning, not page. We use Python to concatenate all of the page files into a single file, with pages correctly numbered. Python then separates the files every time it encounters a new entry title. In the end, we have a single file for each entry, with page breaks and original page number clearly marked. We call these new files entry files.
Footnotes in the print edition appear at the end of the page. Since a digital edition is not print-based and does not have separate pages, this format does not make sense. Instead, we follow the guidelines of the TEI and move the note text into the body text at the point of attachment (i.e., where the original footnote reference appeared). Note text is surrounded by the <note> tag and can be displayed in any form needed in the digital edition.
Early editions use the long-s, and so the word possible is printed as poſſible. OCR programs often reproduce this as poffible, because ſ looks like f. Correcting these problems is complex; the English language has many words like few and sew, where the context is the only way to know for sure whether the original was f or ſ. We use a Python script to resolve most of these difficulties by combining a dictionary with rules about context to automate most of the corrections.