Python Folder
Python script files.
After we create clean digital text for each page of the source text, we still have several critical changes to make to them before they can be used as digital edition masters. Both of these changes are done using a Python script.
- ABBYY FineReader outputs each individual page as a separate file. We need this format in order to create the original page numbers in the digital edition, but we do not want the edition to consist of individual pages. For the Encyclopedia, the entry is the basic unit of meaning, not page. We use Python to concatenate all of the page files into a single file, with pages correctly numbered. Python then separates the files every time it encounters a new entry title. In the end, we have a single file for each entry, with page breaks and original page number clearly marked. We call these new files entry files.
- Footnotes in the print edition appear at the end of the page. Since a digital edition is not print-based and does not have separate pages,
this format does not make sense. Instead, we follow the guidelines of the TEI and move the note text into the body text at the point of
attachment (i.e., where the original footnote reference appeared). Note text is
surrounded by the
<note>
tag and can be displayed in any form needed in the digital edition. - Early editions use the
long-s
, and so the wordpossible
is printed aspoſſible
. OCR programs often reproduce this as poffible, becauseſ
looks likef.
Correcting these problems is complex; the English language has many words likefew
andsew,
where the context is the only way to know for sure whether the original wasf
orſ.
We use a Python script to resolve most of these difficulties by combining a dictionary with rules about context to automate most of the corrections.