From Cyclopaedia to Encyclopédie: experiments in machine translation and sequence alignment

Figure 1. Title page from the 1745 prospectus of the first Encyclopédie project. This page image is taken from ARTFL’s 18th Volume of the Encyclopédie.

It is well known that the Encyclopédie ou dictionnaire raisonné des sciences, des arts et des métiers began first as a modest translation project of Ephraim Chambers’ Cyclopaedia in 1745. Over the next few years, Diderot and D’Alembert would replace the original editors and the project would be duly transformed from a simple translation into an effort to compile and organise the sum total of the world’s knowledge. Over the course of their editorial work, Diderot, and most notably D’Alembert, were not shy in incorporating these translations of the Cyclopaedia as filler for the Encyclopédie. Indeed, ‘ils ont laissé une bonne partie de ces articles presque inchangés, ou avec des modifications insignifiantes’ (Paolo Quintili, ‘D’Alembert “traduit” Chambers. Les articles de mécanique de la Cyclopædia à l’Encyclopédie’, Recherches sur Diderot et sur l’Encyclopédie 21 (1996), p.75). The philosophes were nonetheless conscious of their debt to their English predecessor Chambers. His name appears some 1154 times in the text of the Encyclopédie and he is referenced as sole or contributing source to 1081 articles, where his name appears in italics at the end of a section or article. Given the scale of the two works under consideration, systematic evaluation of the extent of the philosophes’ use of Chambers has remained, even today, a daunting task. John Lough, in 1980, framed the problem nicely: ‘So far no one has had the patience to make a detailed study of the exact relationship between the text of Diderot’s Encyclopédie and the work of Ephraim Chambers. This would no doubt require several years of arduous toil devoted to comparing the two works article by article’(‘The Encyclopédie and Chambers’ Cyclopaedia’, SVEC 185 (1980), p.221).

Recent developments in machine translation and sequence alignment now offer new possibilities for the systematic comparison of digital texts across languages. The following post outlines some recent experimental work in leveraging these new techniques in an effort to reduce the ‘arduous toil’ of textual comparison, giving some preliminary examples of the kinds of results that can be achieved, and providing some cursory observations on the advantages and limitations of such systems for automatic text analysis.

Our two comparison datasets are the ARTFL Encyclopédie (v. 1117) and the recently digitised ARTFL edition of the 1741 Chambers’ Cyclopaedia (link). The 1741 edition was selected as it was one of the likely sources for the translation original project and we were able to work from high quality pages images provided by the University of Chicago Library (On the possible editions of the Cyclopaedia used by the encyclopédistes, see Irène Passeron, ‘Quelle(s) édition(s) de la Cyclopœdia les encyclopédistes ont-ils utilisée(s)?’, Recherches sur Diderot et sur l’Encyclopédie 40-41 (2006), p.287-92.) In a nutshell, our approach was to generate a machine translation of all of the Cyclopaedia articles into French and then use ARTFL’s Text-PAIR sequence alignement system to identify similar passages between this virtual French Cyclopaedia and the Encyclopédie, with the translation providing links back to the original English edition of the Chambers as well as links to the relevant passages in the Encyclopédie.

For the English to French machine translation of Chambers, we examined two of the most widely used resources in this domain, Google Translate and DeepL. Both systems provide useful Application Programming Interfaces [APIs] as part of their respective subscription services, and both provide translations based on cutting-edge neural network language models. We compared results from various samples and found, in general, that both systems worked reasonably well, given the complications of eighteenth-century vocabularies (in both English and French) and many uncommon and archaic terms (this may be the subject of a future post). While DeepL provided somewhat more satisfying translations from a reader’s perspective, we ultimately opted to use Google Translate for the ease of its API and its ability to parse the TEI encoding of our documents with little difficulty. The latter is of critical importance, since we wanted to keep the overall document structure of our dictionaries to allow for easy navigation between the versions.

Operationally, we segmented the text of the Cyclopaedia into short blocks, split at paragraph breaks, and sent them for automatic translation via the Google API, with a short delay between blocks. This worked relatively well, though the system would occasionally throw timeout or other errors, which required a query resend. You can inspect the translation results here – though this virtual French edition of the Chambers is not really meant for public consumption. Each article has a link at the bottom to the corresponding English version for the sake of comparison. It is important to note that the objective here is NOT to produce a good translation of the text or even one that might serve as the basis for a human edition. Rather, this machine-generated edition exists as a ‘pivot-text’ between the English Chambers and the French Encyclopédie, allowing for an automatic comparison of the two (or three) versions using a highly fault-tolerant sequence aligner designed to pick out commonalities in very noisy document spaces. (See Clovis Gladstone, Russ Horton, and Mark Olsen, ‘TextPAIR (Pairwise Alignment for Intertextual Relations)’, ARTFL Project, University of Chicago, 2008-2021, and, more specifically, Mark Olsen, Russell Horton and Glenn Roe, ‘Something borrowed: sequence alignment and the identification of similar passages in large text collections’, Digital Studies / Le Champ numérique 2.1 (2011).)

The next step was to establish workable parameters for the Text-PAIR alignment system. The challenge here was to find commonalities between the French translations created by eighteenth-century authors and translators and machine translations produced by a modern automatic translation system. Additionally, the editors and authors of the Encyclopédie were not necessary constrained to produce an exact translation of the text in question, but could and did, make significant modifications to the original in terms of length, style, and content. To address this challenge we ran a series of tests with different matching parameters such as n-gram construction (e.g., number of words that constitue an n-gram), minimum match lengths, maximum gaps between matches, and decreasing match requirements as a match length increased (what we call a ‘flex gap’) among others on a representative selection of 100 articles from the Encyclopédie where Chambers was identified as the possible source. It is important to note that even with the best parameters, which we adjusted to get favorable recall and precision results, we were only able to identify 81 of the 100 articles. (See comparison table. The primary parameters chosen were bigrams, stemmer=true, word len=3, maxgap=12, flexmatch=true, minmatchingngrams=5. Consult the TextPair documentation and configuration file for a description of these values.) Some articles, even where clearly affiliated, were missed by the aligner, due to the size of the articles (some are very small) and fundamental differences in the translation of the English. For example, the article ‘Compulseur’ is attributed by Mallet to Chambers, but the machine translation of ‘Compulsor’ is a rather more literal and direct translation of the English article than what is offered by Mallet. Further relaxing matching parameters could potentially find this example, but would increase the number of false positives, in effect drowning out the signal with increased noise.

All things considered, we were quite happy with the aligner’s performance given the complexity of the comparison task and the multiple potential variations between historical text and modern machine translations. To give an example of how fine-grained and at the same time highly flexible our matching parameters needed to be, see the below article ‘Gynaecocracy’, which is a fairly direct translation on a rather specialised subject, but that nonetheless matched on only 8 content words (fig. 2).

Figure 2. Comparisons of the article ‘Gynaecocracy’.

Other straightforward articles were however missed due to differences in the translation and sparse matching n-grams, see for example the small article on ‘Occult’ lines in geometry below, where the 6 matching words weren’t enough to constitute a match for the aligner (fig. 3).

Figure 3. Comparisons of the geometry article ‘Occult’.

Obviously this is a rather inexact science, reliant on an outside process of automatic translation and the ability to match a virtual text that in reality never existed. Nonetheless the 81% recall rate we attained on our sample corpus seemed more than sufficient for this experiment and allowed us to move forward towards a more general evaluation of the entirety of identified matches.

Once settled on the optimal parameters, we then Text-PAIR to generate both an alignment database, for interactive examination, and a set of static files. Both of these results formats are used for this project. The alignment database contains some 7304 aligned passage pairs. The system allows queries on metadata, such as author and article title as well as words or phrases found in the aligned passages. The system also uses faceted browsing to allow the user to summarize results by the various metadata (for more on this, see Note below). Each aligned passage is presented as a facing page representation and the user can toggle a display of all of the variations between the two aligned passages. As seen below, the variations between the texts can be extensive (fig. 4).

Figure 4. Text-PAIR interface showing differences in the article ‘Air’.

Text-PAIR also contextualises results back to the original document(s). For example, the following is the article ‘Almanach’ by D’Alembert, showing the aligned passage from Chambers in blue (fig. 5).

Figure 5. Article ‘Almanach’ with shared Chambers passages in blue.

In this instance, D’Alembert reused almost all of Chambers’ original article ‘Almanac’, with some minor variations, but does not to appear to have indicated the source of the first part of his article (page image).

The alignment database is a useful first pass to examine the results of the alignment process, but it is limited in at least two ways. It identifies each aligned passage, but does not merge multiple passages identified in in article pairs. Thus we find 5 shared passages between the articles ‘Constellation’. The interface also does not attempt to evaluate the alignments or identify passages that occur between different articles. For example, D’Alembert’s article ‘ATMOSPHERE’ indeed has a passage from Chambers’ article ‘Atmosphere’, but also many longer passages from the article ‘Generation’.

To accumulate results and to refine evaluation, we subsequently processed the raw Text-PAIR alignment data as found in the static output files. We developed an evaluation algorithm for each alignment, with parameters based on the length of the matching passages and the degree to which the headwords were close matches. This simple evaluation model eliminated a significant number of false positives, which we found were typically short text matches between articles with different headwords. The output of this algorithm resulted in two tables, one for matches that were likely to be valid and one that was less likely to be valid, based on our simple heuristics – see a selection of the ‘YES’ table below (fig. 6). We are, of course, making this distinction based on the comparison of the machine translated Chambers headwords and the headwords found in the Encyclopédie, so we expected that some valid matches would be identified as invalid.

Figure 6. Table of possible article borrowings.

The next phase of the project included the necessary step of human evaluation of the identified matches. While we were able to reduce the work involved significantly by generating a list of reasonably solid matches to be inspected, there is still no way to eliminate fully the ‘arduous toil’ of comparison referenced by Lough. More than 5000 potential matches were scrutinised, looking in essence for ‘false negatives’, i.e., matches that our evaluation algorithm classed as negative (based primarily on differences in headword translations) but that were in reality valid. The results of this work was then merged into in a single table of what we consider to be valid matches, a list that includes some 3700 Encyclopédie articles with at least one matching passage from the Cyclopaedia. These results will form the basis of a longer article that is currently in preparation.

Conclusions

In all, we found some 3778 articles in the Encyclopédie that upon evaluation seem highly similar in both content and structure to articles in the 1741 edition of Chambers’ Cyclopaedia. Whether or not these articles constitute real acts of historical translation is the subject for another, or several other, articles. There are simply too many outside factors at play, even in this rather straightforward comparison, to make blanket conclusions about the editorial practices of the encyclopédistes based on this limited experiment. What we can say, however, is that of the 1081 articles that include a ‘Chambers’ reference in the Encyclopédie, we only found 689 with at least one matching passage. Obviously this recall rate of 63.7% is well below the 81% we attained on our sample corpus, probably due to overfitting the matching algorithm to the sample, which warrants further investigation. But beyond testing this ground truth, we are also left with the rather astounding fact of 3089 articles with no reference to Chambers whatsoever, all of which seem, at first blush, to be at least somewhat related to their English predecessors.

The overall evaluation of these results remains ongoing, and the ‘arduous toil’ of traditional textual comparison continues apace, albeit guided somewhat by the machine’s heavy hand. Indeed, the use of machine translation as a bridge between documents to find similar passages, be they reuses, plagiarisms, etc., is, as we have attempted to show here, a workable approach for future research, although not without certain limitations. The Chambers–Encyclopédie task outlined above is fairly well constrained and historically bounded. More general applications of these same methods may well yield less useful results. These reservations notwithstanding, the fact that we were able to unearth many thousands of valid potential intertextual relationships between documents in different languages is a feat that even a few years ago might not have been possible. As large-scale language models become ever more sophisticated and historically aware, the dream of intertextual bridges between multilingual corpora may yet become a reality. (For more on ‘intertextual bridges’ in French, see our current NEH project.)

Note

The question of the Dictionnaire de Trévoux is one such factor, as it is known that both Chambers and the encyclopédistes used it as a source for their own articles – so matches we find between the Chambers and Encyclopédie may indeed represent shared borrowings from the Trévoux and not a translation at all. Or, more interestingly, perhaps Chambers translated a Trévoux article from French to English, which a dutiful encyclopédiste then translated back to French for the Encyclopédie – in this case, which article is the ‘source’ and which the ‘translation’? For more on these particular aspects of dictionary-making, see our previous article ‘Plundering philosophers: identifying sources of the Encyclopédie’, Journal of the Association for History and Computing 13.1 (Spring 2010) and Marie Leca-Tsiomis’ response, ‘The use and abuse of the digital humanities in the history of ideas: how to study the Encyclopédie’, History of European ideas 39.4 (2013), p.467-76.

– Glenn Roe and Mark Olsen

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.