Named Entity Recognition

Running named entity recognition on entries

  1. Create a batch folder for processing the group of files in the outputs\metadata directory, using the same alphabetical folder tree as the entry files.
    outputs\metadata\eb03\a\
  2. In Oxygen XML Editor, use tei-to-text.xsl to create TXT versions of the XML entry files. Keep the same base filename. Output the TXT files to the batch folder. We will use these to generate our NER data and to process it in HIVE2.
  3. To generate the NER data, use Stanza, the Stanford NLP Group’s current natural language processing toolkit for Python, which includes Stanford NER. Run Stanza on the TXT batch using two different NER subsets:
    1. General entities (NER Topics). Save results in the batch folder as csv files, with a appended to the base filename.
    2. Locations (NER Geo). Save results in the batch folder as CSV files, with b appended to the base filename.
  4. Each entry will now have three files in the batch folder:
    each entry full text kp*.txt
    each entry NER Topics kp*a.csv
    each entry NER Geo kp*b.csv