Cleanup Page Files
Fix the most common OCR errors in the TEI-page files.
Run search and replace on batches of page files to correct common OCR errors.
| search | replace | settings |
|---|---|---|
| Cleanup DOCX encoding | ||
| Remove all variants of @rend="CharStylenn" @xml:space="preserve"> from <hi> and <seg> | ||
|
|
regex |
| replace all <seg> elements with <hi> | ||
|
|
regex |
| Replace OCR errors in entry terms: rerun this set after first run | ||
|
|
case sensitive; regex |
|
|
case sensitive; regex |
|
|
case sensitive; regex |
|
|
case sensitive; regex |
|
case sensitive; regex; manually correct | |
| Find hidden entries | ||
For consonants, remove unneeded
characters between <p> and the entry term (sample
uses "B" for the variable letter) |
||
|
|
case sensitive; regex; check each instance |
For vowels, remove unneeded
characters between <p> and entry terms (sample
uses "A" and it's accented forms) |
||
|
|
case sensitive; regex; check each instance |
| Identify entry terms that are not on a new line (sample uses "B") | ||
|
|
case sensitive; regex; check each instance |
| Find entry terms with a space or non-Roman character after the first term (sample initial letter is "B") | ||
|
|
case sensitive; regex; check each instance |
| Prevent terms in cells beginning with the entry letter from being mistaken for entries | ||
|
|
case sensitive; regex |
| Remove extra space around "A" in entry terms | ||
|
|
case sensitive; regex |
|
|
case sensitive; regex |
| Clean up footnote @@@s | ||
| Remove dirt preceding @@@ at the beginning of a line | ||
|
|
regex; check each instance |
| Fix instances of multiple @@@s on the same line | ||
|
|
regex; check each instance |
| Find @@ after @@@ | ||
|
regex; manual cleanup | |
Remove
encoding around
@@@'s |
||
|
|
regex |
| For eb03, find @@@ + note siglum inside parentheses and move @@@ outside parentheses, i.e. (@@@a) ==> @@@(a) | ||
|
|
regex (single line) |
| Correct text errors | ||
| Replace fécond with second | ||
|
|
case sensitive |
| Replace copyright and registered trademark symbols with o | ||
|
|
case-sensitive; regex |
| Remove dirt after the letter w | ||
|
|
case-sensitive; regex |
| Fix legal citations (eb07, eb09, eb11 only) | ||
|
|
case-sensitive |
| Correct roman numerals | ||
|
|
case-sensitive; regex; check each instance |
|
|
case-sensitive; regex; check each instance |
| Fix life dates in entry terms | ||
|
|
regex |
