Cleanup Page Files

Fix the most common OCR errors in the TEI-page files.

Run search and replace on batches of page files to correct common OCR errors.

search replace settings
Cleanup DOCX encoding
Remove all variants of @rend="CharStylenn" @xml:space="preserve"> from <hi> and <seg>
<hi\s*rend=\"CharStyle\d{1,2}\"\s*(xml\:space=\"preserve\")?\/?>(.*?)?<\/hi>
\2
regex
replace all <seg> elements with <hi>
<seg(\s*)(rend\=\"[a-z\s]+\")\s*>(.*?)<\/seg>
<hi\1\2>\3</hi>
regex
Replace OCR errors in entry terms: rerun this set after first run
(<p>[A-ZÀ-Ù]+)[1lſ]([A-ZÀ-Ù])
\1I\2
case sensitive; regex
(<p>[A-ZÀ-Ù]+)[0θ]([A-ZÀ-Ù])
\1O\2
case sensitive; regex
(<p>[A-ZÀ-Ù]+)Ľ([A-ZÀ-Ù])
\1E\2
case sensitive; regex
(<p>[A-ZÀ-Ù]+)Λ([A-ZÀ-Ù])
\1A\2
case sensitive; regex
(<p>[A-ZÀ-Ù]+)Γ([A-ZÀ-Ù])
case sensitive; regex; manually correct
Find hidden entries
For consonants, remove unneeded characters between <p> and the entry term (sample uses "B" for the variable letter)
(<p>)[^A-ZÀ-Ù]{1,3}(B[A-ZÀ-Ù]+)
\1\2
case sensitive; regex; check each instance
For vowels, remove unneeded characters between <p> and entry terms (sample uses "A" and it's accented forms)
<p>[^A-Z]{1,3}([AÀ-ÆĀĂĄ][A-ZÀ-Ý]+)
<p>\1
case sensitive; regex; check each instance
Identify entry terms that are not on a new line (sample uses "B")
([\s\.])(B[A-ZÀ-Ù]{2,})
\1</p><p>\2
case sensitive; regex; check each instance
Find entry terms with a space or non-Roman character after the first term (sample initial letter is "B")
(<p>B)[\s\W]([A-ZÀ-Ù])
\1\2
case sensitive; regex; check each instance
Prevent terms in cells beginning with the entry letter from being mistaken for entries
(<cell><p>)(B[A-Z].*?)(\<\/p>)
\1<hi>\2</hi>\3
case sensitive; regex
Remove extra space around "A" in entry terms
(<p>[A-ZÀ-Ù]+)\sA\s?([A-ZÀ-Ù])
\1A\2
case sensitive; regex
(<p>[A-ZÀ-Ù]+)\s?A\s([A-ZÀ-Ù])
\1A\2
case sensitive; regex
Clean up footnote @@@s
Remove dirt preceding @@@ at the beginning of a line
<p>[\s\W]+(<(?:\w|\"|=|\(|\)|\s)+>?@{3})
<p>>\1
regex; check each instance
Fix instances of multiple @@@s on the same line
([\.\"?:*])\s*(<(?:\w|\"|=|\(|\)|\s)+>)?@{3}
\1</p><p>\2@@@
regex; check each instance
Find @@ after @@@
(@{3})([\w\s<>.,;\-—&/=\"]*)(?<!@)(@{2}[^@])
regex; manual cleanup
Remove <hi> encoding around @@@'s
<hi rend\=\"\w*\">(@@@[\[mu\]]*?)(\s?)<\/hi>
\1\2
regex
For eb03, find @@@ + note siglum inside parentheses and move @@@ outside parentheses, i.e. (@@@a) ==> @@@(a)
(<p>)(<hi rend=\"smallcaps\">)?\(\s*(</hi>)?(<hi rend=\"smallcaps\">)?
(@{3,})(</hi>)?(<hi rend=\"[a-z]+\">)?(.)\)?(</hi>)?\)?
\1\5(hi rend="smallcaps"\8</hi>)
regex (single line)
Correct text errors
Replace fécond with second
fécond
second
case sensitive
Replace copyright and registered trademark symbols with o
©|®
o
case-sensitive; regex
Remove dirt after the letter w
w(\s*<hi rend=\"superscript\">\s*(,|τ|j|r|f|i|T|7)\s*</hi>)(\s*)
w
case-sensitive; regex
Fix legal citations (eb07, eb09, eb11 only)
Viet.
Vict.
case-sensitive
Correct roman numerals
\b([IVXCM])L\b([^\.])
\1I.\2
case-sensitive; regex; check each instance
([IVXLCM\s])[HΠ]([IVXLCM\.])
\1II\2
case-sensitive; regex; check each instance
Fix life dates in entry terms
\s*\(c\.<\/hi>
</hi> (<hi rend="italic">c.</hi>
regex