Named Entity Recognition
Running named entity recognition on entries
-
Create a batch folder for processing the group of files in the
outputs\metadata directory, using the same alphabetical
folder tree as the entry files.
outputs\metadata\eb03\a\
- In Oxygen XML Editor, use tei-to-text.xsl to create TXT versions of the XML entry files. Keep the same base filename. Output the TXT files to the batch folder. We will use these to generate our NER data and to process it in HIVE2.
-
To generate the NER data, use Stanza, the
Stanford NLP Group’s current natural language processing toolkit for Python, which includes Stanford NER. Run Stanza on the TXT batch
using two different NER subsets:
- General entities (NER Topics). Save results in the batch folder as csv files, with a appended to the base filename.
- Locations (NER Geo). Save results in the batch folder as CSV files, with b appended to the base filename.
-
Each entry will now have three files in the batch folder:
each entry full text kp*.txt each entry NER Topics kp*a.csv each entry NER Geo kp*b.csv