4-entry-tei Folder

Contains the TEI-encoded data after the page files are converted to entry files.

Entry files contain one entry per file. The entry is the basic unit of meaning in the Encyclopedia, and it may range in size from a single sentence to a book-length exposition.

We create entry files by running a Python script on a section of TEI page files. The script combines all page files and then segments the data at the entry terms to create a new file for each entry. We preserve the original page numbers as well as references to the source image for the page.

Entry files use a different naming convention than page files. We retain this new pattern from this point forward in the processing workflow. The revised filename creates a unique identifier for each entry and more precisely indicates the entry location in the print source. It includes the following elements separate by hyphens:

File-naming scheme

kp - print-edition & volume - image-sequence - page-number - position-on-the-page.
  • The image-sequence is a 4-digit number taken from the filename of the page scan image.
  • The position-on-the-page is a 2-digit number indicating whether the entry appears first, second, or third on that page.
In the figure below, kp-eb1128-0244-0223-03.xml is the 3nd entry, on page 223, of vol. 28, in the 11th edition. Its first page was scanned from an image ending in 0244.
Note: The inclusion of the image-sequence number insures that each filename is unique; more than once, print editions repeat the same page number to insert additional material, hampering their use as a guarantee of uniqueness.
Figure 1. The 4-entry-tei folder