Convert Pages to Entries

Use Python to convert an entire section of page files into entry files.

While experienced personnel take charge of this process, everyone can benefit from understanding how the conversion works. Python can repeat a complex series of steps to accomplish repetitive tasks quickly. In order to convert page files into entry files, it has to performs the following operations.

  1. For all of the page files in one page section, it creates a copy and removes the TEI header, leaving just the body text.
  2. It then concatenates all of these page texts into a single file, while retaining a record of where each page starts and ends by inserting <pb> tags.
  3. The script looks for all of the footnote anchors (@@) and pairs them with the appropriate footnote text (@@@). It then moves the footnote text to the anchor point in the body text and inserts it, surrounded with the <note> tag. Finally, it removes the @ codes.
  4. It then searches for all of the entry terms, to identify the start and end of each individual entry.
  5. We need to include accurate page numbers and the name of the image file that we scanned to create the text. Python open the page-inventory file, looks up the information, and inserts it into the <pb> element. It appends a 2-digit sequence to the end of the page number, to indicate whether the entry is the first, second, or other entry on the page. (See Page Numbers.)
  6. Finally, it outputs the text for each entry into a single file, adds an empty TEI header, and save it with a new filename indicating the edition, volume, section, page, and entry number. This is the entry file.

Problems

Most Python problem are caused by inconsistencies in footnote formatting. The following are the most common reasons for Python to generate an error ("index out of range").

  1. unequal numbers of @@ and @@@ codes.
  2. more than a single @@@ in the same <p>.
  3. @@@ preceded by a text character instead of a TEI element.
  4. @@ in a table when nested within a <p> inside <cell>.