Cleanup Page Files
Fix the most common OCR errors in the TEI-page files.
Run search and replace on batches of page files to correct common OCR errors.
search | replace | settings |
---|---|---|
Cleanup DOCX encoding | ||
Remove all variants of @rend="CharStylenn" @xml:space="preserve"> from <hi> and <seg> | ||
|
|
regex |
replace all <seg> elements with <hi> | ||
|
|
regex |
Replace OCR errors in entry terms: rerun this set after first run | ||
(<p>[A-ZÀ-Ù]+)[1lſ]([A-ZÀ-Ù]) |
\1I\2 |
case sensitive; regex |
(<p>[A-ZÀ-Ù]+)[0θ]([A-ZÀ-Ù]) |
\1O\2 |
case sensitive; regex |
(<p>[A-ZÀ-Ù]+)Ľ([A-ZÀ-Ù]) |
\1E\2 |
case sensitive; regex |
|
|
case sensitive; regex |
(<p>[A-ZÀ-Ù]+)Γ([A-ZÀ-Ù]) |
case sensitive; regex; manually correct | |
Find hidden entries | ||
For consonants, remove unneeded
characters between <p> and the entry term (sample
uses "B" for the variable letter) |
||
(<p>)[^A-ZÀ-Ù]{1,3}(B[A-ZÀ-Ù]+) |
\1\2 |
case sensitive; regex; check each instance |
For vowels, remove unneeded
characters between <p> and entry terms (sample
uses "A" and it's accented forms) |
||
|
|
case sensitive; regex; check each instance |
Identify entry terms that are not on a new line (sample uses "B") | ||
([\s\.])(B[A-ZÀ-Ù]{2,}) |
\1 |
case sensitive; regex; check each instance |
Find entry terms with a space or non-Roman character after the first term (sample initial letter is "B") | ||
( |
\1\2 |
case sensitive; regex; check each instance |
Prevent terms in cells beginning with the entry letter from being mistaken for entries | ||
|
|
case sensitive; regex |
Remove extra space around "A" in entry terms | ||
|
|
case sensitive; regex |
|
|
case sensitive; regex |
Clean up footnote @@@s | ||
Remove dirt preceding @@@ at the beginning of a line | ||
|
|
regex; check each instance |
Fix instances of multiple @@@s on the same line | ||
|
|
regex; check each instance |
Find @@ after @@@ | ||
|
regex; manual cleanup | |
Remove
encoding around
@@@'s |
||
<hi rend\=\"\w*\">(@@@[\[mu\]]*?)(\s?)<\/hi> |
\1\2 |
regex |
For eb03, find @@@ + note siglum inside parentheses and move @@@ outside parentheses, i.e. (@@@a) ==> @@@(a) | ||
( (@{3,})( |
\1\5(hi rend="smallcaps"\8 |
regex (single line) |
Correct text errors | ||
Replace fécond with second | ||
fécond |
second |
case sensitive |
Replace copyright and registered trademark symbols with o | ||
©|® |
o |
case-sensitive; regex |
Remove dirt after the letter w | ||
w(\s*<hi rend=\"superscript\">\s*(,|τ|j|r|f|i|T|7)\s*</hi>)(\s*) |
w |
case-sensitive; regex |
Fix legal citations (eb07, eb09, eb11 only) | ||
Viet. |
Vict. |
case-sensitive |
Correct roman numerals | ||
\b([IVXCM])L\b([^\.]) |
\1I.\2 |
case-sensitive; regex; check each instance |
|
|
case-sensitive; regex; check each instance |
Fix life dates in entry terms | ||
|
|
regex |