Cleanup Page Files

Fix the most common OCR errors in the TEI-page files.

Run search and replace on batches of page files to correct common OCR errors.


search	replace	settings
Cleanup DOCX encoding
Remove all variants of @rend="CharStylenn" @xml:space="preserve"> from <hi> and <seg>
`<hi\srend=\"CharStyle\d{1,2}\"\s(xml\:space=\"preserve\")?\/?>(.*?)?<\/hi>`	`\2`	regex
replace all <seg> elements with <hi>
`<seg(\s)(rend\=\"[a-z\s]+\")\s>(.*?)<\/seg>`	`<hi\1\2>\3</hi>`	regex
Replace OCR errors in entry terms: rerun this set after first run
`(<p>[A-ZÀ-Ù]+)[1lſ]([A-ZÀ-Ù])`	`\1I\2`	case sensitive; regex
`(<p>[A-ZÀ-Ù]+)[0θ]([A-ZÀ-Ù])`	`\1O\2`	case sensitive; regex
`(<p>[A-ZÀ-Ù]+)Ľ([A-ZÀ-Ù])`	`\1E\2`	case sensitive; regex
`(<p>[A-ZÀ-Ù]+)Λ([A-ZÀ-Ù])`	`\1A\2`	case sensitive; regex
`(<p>[A-ZÀ-Ù]+)Γ([A-ZÀ-Ù])`		case sensitive; regex; manually correct
Find hidden entries
For consonants, remove unneeded characters between `<p>` and the entry term (sample uses "B" for the variable letter)
`(<p>)[^A-ZÀ-Ù]{1,3}(B[A-ZÀ-Ù]+)`	`\1\2`	case sensitive; regex; check each instance
For vowels, remove unneeded characters between `<p>` and entry terms (sample uses "A" and it's accented forms)
`<p>[^A-Z]{1,3}([AÀ-ÆĀĂĄ][A-ZÀ-Ý]+)`	`<p>\1`	case sensitive; regex; check each instance
Identify entry terms that are not on a new line (sample uses "B")
`([\s\.])(B[A-ZÀ-Ù]{2,})`	`\1</p><p>\2`	case sensitive; regex; check each instance
Find entry terms with a space or non-Roman character after the first term (sample initial letter is "B")
`(<p>B)[\s\W]([A-ZÀ-Ù])`	`\1\2`	case sensitive; regex; check each instance
Prevent terms in cells beginning with the entry letter from being mistaken for entries
`(<cell><p>)(B[A-Z].*?)(\<\/p>)`	`\1<hi>\2</hi>\3`	case sensitive; regex
Remove extra space around "A" in entry terms
`(<p>[A-ZÀ-Ù]+)\sA\s?([A-ZÀ-Ù])`	`\1A\2`	case sensitive; regex
`(<p>[A-ZÀ-Ù]+)\s?A\s([A-ZÀ-Ù])`	`\1A\2`	case sensitive; regex
Clean up footnote @@@s
Remove dirt preceding @@@ at the beginning of a line
`<p>[\s\W]+(<(?:\w\|\"\|=\|\(\|\)\|\s)+>?@{3})`	`<p>>\1`	regex; check each instance
Fix instances of multiple @@@s on the same line
`([\.\"?:])\s(<(?:\w\|\"\|=\|\(\|\)\|\s)+>)?@{3}`	`\1</p><p>\2@@@`	regex; check each instance
Find @@ after @@@
`(@{3})([\w\s<>.,;\-—&/=\"]*)(?<!@)(@{2}[^@])`		regex; manual cleanup
Remove `<hi>` encoding around @@@'s
`<hi rend\=\"\w\">(@@@[\[mu\]]?)(\s?)<\/hi>`	`\1\2`	regex
For eb03, find @@@ + note siglum inside parentheses and move @@@ outside parentheses, i.e. (@@@a) ==> @@@(a)
`(<p>)(<hi rend=\"smallcaps\">)?\(\s*(</hi>)?(<hi rend=\"smallcaps\">)?` `(@{3,})(</hi>)?(<hi rend=\"[a-z]+\">)?(.)\)?(</hi>)?\)?`	`\1\5(hi rend="smallcaps"\8</hi>)`	regex (single line)

Correct text errors
Replace fécond with second
`fécond`	`second`	case sensitive
Replace copyright and registered trademark symbols with o
`©\|®`	`o`	case-sensitive; regex
Remove dirt after the letter w
`w(\s<hi rend=\"superscript\">\s(,\|τ\|j\|r\|f\|i\|T\|7)\s</hi>)(\s)`	`w`	case-sensitive; regex
Fix legal citations (eb07, eb09, eb11 only)
`Viet.`	`Vict.`	case-sensitive
Correct roman numerals
`\b([IVXCM])L\b([^\.])`	`\1I.\2`	case-sensitive; regex; check each instance
`([IVXLCM\s])[HΠ]([IVXLCM\.])`	`\1II\2`	case-sensitive; regex; check each instance
Fix life dates in entry terms
`\s*\(c\.<\/hi>`	`</hi> (<hi rend="italic">c.</hi>`	regex