Cleanup Page Files
Fix the most common OCR errors in the TEI-page files.
Run search and replace on batches of page files to correct common OCR errors.
search | replace | settings |
---|---|---|
Cleanup DOCX encoding | ||
Remove all variants of @rend="CharStylenn" @xml:space="preserve"> from <hi> and <seg> | ||
|
|
regex |
replace all <seg> elements with <hi> | ||
|
|
regex |
Replace OCR errors in entry terms: rerun this set after first run | ||
|
|
case sensitive; regex |
|
|
case sensitive; regex |
|
|
case sensitive; regex |
|
|
case sensitive; regex |
|
case sensitive; regex; manually correct | |
Find hidden entries | ||
For consonants, remove unneeded
characters between <p> and the entry term (sample
uses "B" for the variable letter) |
||
|
|
case sensitive; regex; check each instance |
For vowels, remove unneeded
characters between <p> and entry terms (sample
uses "A" and it's accented forms) |
||
|
|
case sensitive; regex; check each instance |
Identify entry terms that are not on a new line (sample uses "B") | ||
|
|
case sensitive; regex; check each instance |
Find entry terms with a space or non-Roman character after the first term (sample initial letter is "B") | ||
|
|
case sensitive; regex; check each instance |
Prevent terms in cells beginning with the entry letter from being mistaken for entries | ||
|
|
case sensitive; regex |
Remove extra space around "A" in entry terms | ||
|
|
case sensitive; regex |
|
|
case sensitive; regex |
Clean up footnote @@@s | ||
Remove dirt preceding @@@ at the beginning of a line | ||
|
|
regex; check each instance |
Fix instances of multiple @@@s on the same line | ||
|
|
regex; check each instance |
Find @@ after @@@ | ||
|
regex; manual cleanup | |
Remove
encoding around
@@@'s |
||
|
|
regex |
For eb03, find @@@ + note siglum inside parentheses and move @@@ outside parentheses, i.e. (@@@a) ==> @@@(a) | ||
|
|
regex (single line) |
Correct text errors | ||
Replace fécond with second | ||
|
|
case sensitive |
Replace copyright and registered trademark symbols with o | ||
|
|
case-sensitive; regex |
Remove dirt after the letter w | ||
|
|
case-sensitive; regex |
Fix legal citations (eb07, eb09, eb11 only) | ||
|
|
case-sensitive |
Correct roman numerals | ||
|
|
case-sensitive; regex; check each instance |
|
|
case-sensitive; regex; check each instance |
Fix life dates in entry terms | ||
|
|
regex |