Project history
History of the project
A brief history of the Knowledge Project
Early editions
In the history of Britannica, many of the early editions consisted primarily of reprints from previous editions with a smattering of newer material 1. While there were eleven editions published between 1771 and 1911, only four were dominated by original essays reflecting the latests developments in their field. These are the four editions selected for the Knowledge Project: the 3rd, 7th, 9th, and 11th.
Fortunately, high resolution reproductions of these four editions are readily available from the Internet Archive, Hathi Trust, and in 2020, the National Library of Scotland. Thus there was no need to create new images.
Text mining works with transcriptions of the page images rather than the images themselves. Researchers use OCR to extract textual data from images, creating a text layer that can be associated with each image. In examining the available text layers, we found that they all suffered from relatively low rates of accuracy. This was caused by Britannica's complex two-column layout, which used glosses in the margins and in both margins and at the bottom of pages. Many pages also had illustrations interspersed with text. Because of the size of each edition, the OCR process had to be automated, and this inability to work with the difficulties of individual pages led to text that was too inaccurate to produce valid results with existing text mining methods.
Years of OCR work
What began as a text-mining research project turned into six years of dedicated OCR work creating text that is accurate enough to meet the needs of scholars engaged in text mining analysis.2 In the end, we were able to achieve an accuracy rate of >99.5% for all editions except the earliest, the 3rd, where we archived a rate of >99.2%. To put these rates into perspective, while a rate of 90% sounds excellent, it means that one out of every 10 words has an error in it. This means that every sentence has one or more errors! By contrast, 99.5% means there is still one error in every 200 words, so perhaps one per paragraph. While not perfect, it will yield valid results when using text mining methods.
Ideally, we would proofread every page to generate "clean" OCR data. But with 86,531 pages averaging 1,273 words each, the data set totals over 110 million words, making proofreading expensive and impractical. Simply reading it as quickly as possible would take one person an estimated seven to eight years.
Encoding as TEI
With the OCR work complete, all of that material had to be converted into TEI for preservation and enrichment. TEI is a set of standards designed for encoding textual data. While few sites will display TEI directly, its strength is that it converts easily into any desired output format, whether epub, html, PDF, or many others. It thus serves as a perfect archival format.
Due to the scale of the material, we automated this procedure. The OCR produced output as Microsoft Word (DOCX) documents, and we converted them directly into TEI using an XSLT script. We created a Python script to combine the TEI for individual pages into a large continuous document that could then be subdivided at the beginning of each individual Encyclopedia entry. Over the years we developed additional Python scripts to refine the files by correcting common OCR errors.
Long-S (ſ)
The 3rd ed. is the only one of the four to include the 'long-s' mark, ſ. Converting this to the modern 's' would be simple if OCR recognized it perfectly in old books, but too often 'ſ' is mistaken for 'f' (or 'l' or several others). Complicating the problem further is the occurrence of words in English like 'fat' and 'sat,' where only a look at context can determine which is the intended spelling for 'ſat.'
Don Kretz deserves great credit for imagining and original solution to the problem. He compared the frequency of words occurring in the 3rd ed. with those in the 7th, where there is no long-s. He then used the word frequencies from the 7th edition to estimate what the most likely choice would be for the word with the long-s in the 3rd. If the probability hit a threshold value, his program automatically implemented the change.
Metadata
The inclusion of index terms identifying the subject matter of the entry is a distinguishing feature of the data set. Thanks to awards from the IMLS-funded LEADS-4-NDP program, we worked with two doctoral students from the Metadata Research Center at Drexel University in the summers of 2018 and 2019 testing methods of automatically generating index terms and comparing results. As a result, we learned that using current controlled vocabularies, like the Library of Congress Subject Headings resulted in the emergence of anachronistic terms for topics that did not exist in the 19th century, such as identifying a term as a computer program. At the same time, it also missed significant historical concepts because they relied on terminology no longer in use. All tests were run through a free-standing version of HIVE2, an online vocabulary server maintained by the MRC (HIVE2).
In 2019, we decided to test the hypothesis that an older controlled vocabulary—something closer in time to the encyclopedias—would create better results because of a closer match in terminology. No suitable vocabularies existed in a usable form for such an undertaking, so we created one from the first edition of the LCSH, dated 1910. By running OCR on the pages, we recreated the entire vocabulary in the SKOS format and added it to HIVE. It appears there for online use as the "1910 Library of Congress Subject Headings." Testing resulted in a slight but significant improvement in accuracy.
The final procedure included the 1910 LCSH, several narrow facets of the current LCSH, and two elements of named entity recognition. The complete process is described in Automated Metadata Procedure.
Releases
Our first public release of the data set was the TXT version of the 7th ed. in October 2022. The plain text version of the 9th ed. followed in November. An XML version of the 7th ed. came out in December, including the full TEI encoding. As of this writing, a similar version of the 9th ed. is scheduled soon after, followed by TXT versions of the 3rd and 11th eds., and finally XML editions of both.