entry Folder

Contains the TEI-encoded data after the page files are converted to entry files.

Entry files contain one entry per file. The entry is the basic unit of meaning in the Encyclopedia, and it may range in size from a single sentence to a book-length exposition.

We create entry files by running a Python script on a page section of TEI page files. The script combines all page files and then segments the data at the entry terms to create a new file for each entry. We preserve the original page numbers as well as references to the source image for the page.

Entry files use different file and folder naming conventions than page files. The revised folder name includes the print volume number. The revised filename creates a unique identifier for each entry and more precisely indicates the entry location in the print source.

Entry folder names

Entry folders have two different naming conventions, depending on whether the files are still in process or processing is completed.

IN PROCESS

letter + volume + batch

  • "letter" is the section of the alphabet for the entry.
  • "volume" is the volume of the print edition.
  • "batch" is a 2-digit sequence for the subset of entries for the letter.

In the figure below, a0105 includes "A" entries from volume one and is the fifth batch of "A" entries. a0206 is the next batch of "A" entries, which are located in volume two.

Figure 1. in-process entry folder names

AFTER PROCESSING

Once complete, we combine all entries into alphabetical folders based on their entry terms.

Figure 2. Completed entry folder names

Entry file names

kp + print-edition & volume + image-sequence + page-number + position-on-the-page.
  • The image-sequence is a 4-digit number taken from the filename of the page scan image.
  • The position-on-the-page is a 2-digit number indicating whether the entry appears first, second, or third on that page.
In the figure below, kp-eb1128-0244-0223-03.xml is the 3nd entry, on page 223, of vol. 28, in the 11th edition. Its first page was scanned from an image ending in 0244.
Note:
The inclusion of the image-sequence number insures that each filename is unique; more than once, print editions repeat the same page number to insert additional material, hampering their use as a guarantee of uniqueness.
Figure 3. The entry folder