From Cyclopaedia to Encyclopédie: experiments in machine translation and sequence alignment

Figure 1. Title page from the 1745 prospectus of the first Encyclopédie project. This page image is taken from ARTFL’s 18th Volume of the Encyclopédie.

It is well known that the Encyclopédie ou dictionnaire raisonné des sciences, des arts et des métiers began first as a modest translation project of Ephraim Chambers’ Cyclopaedia in 1745. Over the next few years, Diderot and D’Alembert would replace the original editors and the project would be duly transformed from a simple translation into an effort to compile and organise the sum total of the world’s knowledge. Over the course of their editorial work, Diderot, and most notably D’Alembert, were not shy in incorporating these translations of the Cyclopaedia as filler for the Encyclopédie. Indeed, ‘ils ont laissé une bonne partie de ces articles presque inchangés, ou avec des modifications insignifiantes’ (Paolo Quintili, ‘D’Alembert “traduit” Chambers. Les articles de mécanique de la Cyclopædia à l’Encyclopédie’, Recherches sur Diderot et sur l’Encyclopédie 21 (1996), p.75). The philosophes were nonetheless conscious of their debt to their English predecessor Chambers. His name appears some 1154 times in the text of the Encyclopédie and he is referenced as sole or contributing source to 1081 articles, where his name appears in italics at the end of a section or article. Given the scale of the two works under consideration, systematic evaluation of the extent of the philosophes’ use of Chambers has remained, even today, a daunting task. John Lough, in 1980, framed the problem nicely: ‘So far no one has had the patience to make a detailed study of the exact relationship between the text of Diderot’s Encyclopédie and the work of Ephraim Chambers. This would no doubt require several years of arduous toil devoted to comparing the two works article by article’(‘The Encyclopédie and Chambers’ Cyclopaedia’, SVEC 185 (1980), p.221).

Recent developments in machine translation and sequence alignment now offer new possibilities for the systematic comparison of digital texts across languages. The following post outlines some recent experimental work in leveraging these new techniques in an effort to reduce the ‘arduous toil’ of textual comparison, giving some preliminary examples of the kinds of results that can be achieved, and providing some cursory observations on the advantages and limitations of such systems for automatic text analysis.

Our two comparison datasets are the ARTFL Encyclopédie (v. 1117) and the recently digitised ARTFL edition of the 1741 Chambers’ Cyclopaedia (link). The 1741 edition was selected as it was one of the likely sources for the translation original project and we were able to work from high quality pages images provided by the University of Chicago Library (On the possible editions of the Cyclopaedia used by the encyclopédistes, see Irène Passeron, ‘Quelle(s) édition(s) de la Cyclopœdia les encyclopédistes ont-ils utilisée(s)?’, Recherches sur Diderot et sur l’Encyclopédie 40-41 (2006), p.287-92.) In a nutshell, our approach was to generate a machine translation of all of the Cyclopaedia articles into French and then use ARTFL’s Text-PAIR sequence alignement system to identify similar passages between this virtual French Cyclopaedia and the Encyclopédie, with the translation providing links back to the original English edition of the Chambers as well as links to the relevant passages in the Encyclopédie.

For the English to French machine translation of Chambers, we examined two of the most widely used resources in this domain, Google Translate and DeepL. Both systems provide useful Application Programming Interfaces [APIs] as part of their respective subscription services, and both provide translations based on cutting-edge neural network language models. We compared results from various samples and found, in general, that both systems worked reasonably well, given the complications of eighteenth-century vocabularies (in both English and French) and many uncommon and archaic terms (this may be the subject of a future post). While DeepL provided somewhat more satisfying translations from a reader’s perspective, we ultimately opted to use Google Translate for the ease of its API and its ability to parse the TEI encoding of our documents with little difficulty. The latter is of critical importance, since we wanted to keep the overall document structure of our dictionaries to allow for easy navigation between the versions.

Operationally, we segmented the text of the Cyclopaedia into short blocks, split at paragraph breaks, and sent them for automatic translation via the Google API, with a short delay between blocks. This worked relatively well, though the system would occasionally throw timeout or other errors, which required a query resend. You can inspect the translation results here – though this virtual French edition of the Chambers is not really meant for public consumption. Each article has a link at the bottom to the corresponding English version for the sake of comparison. It is important to note that the objective here is NOT to produce a good translation of the text or even one that might serve as the basis for a human edition. Rather, this machine-generated edition exists as a ‘pivot-text’ between the English Chambers and the French Encyclopédie, allowing for an automatic comparison of the two (or three) versions using a highly fault-tolerant sequence aligner designed to pick out commonalities in very noisy document spaces. (See Clovis Gladstone, Russ Horton, and Mark Olsen, ‘TextPAIR (Pairwise Alignment for Intertextual Relations)’, ARTFL Project, University of Chicago, 2008-2021, and, more specifically, Mark Olsen, Russell Horton and Glenn Roe, ‘Something borrowed: sequence alignment and the identification of similar passages in large text collections’, Digital Studies / Le Champ numérique 2.1 (2011).)

The next step was to establish workable parameters for the Text-PAIR alignment system. The challenge here was to find commonalities between the French translations created by eighteenth-century authors and translators and machine translations produced by a modern automatic translation system. Additionally, the editors and authors of the Encyclopédie were not necessary constrained to produce an exact translation of the text in question, but could and did, make significant modifications to the original in terms of length, style, and content. To address this challenge we ran a series of tests with different matching parameters such as n-gram construction (e.g., number of words that constitue an n-gram), minimum match lengths, maximum gaps between matches, and decreasing match requirements as a match length increased (what we call a ‘flex gap’) among others on a representative selection of 100 articles from the Encyclopédie where Chambers was identified as the possible source. It is important to note that even with the best parameters, which we adjusted to get favorable recall and precision results, we were only able to identify 81 of the 100 articles. (See comparison table. The primary parameters chosen were bigrams, stemmer=true, word len=3, maxgap=12, flexmatch=true, minmatchingngrams=5. Consult the TextPair documentation and configuration file for a description of these values.) Some articles, even where clearly affiliated, were missed by the aligner, due to the size of the articles (some are very small) and fundamental differences in the translation of the English. For example, the article ‘Compulseur’ is attributed by Mallet to Chambers, but the machine translation of ‘Compulsor’ is a rather more literal and direct translation of the English article than what is offered by Mallet. Further relaxing matching parameters could potentially find this example, but would increase the number of false positives, in effect drowning out the signal with increased noise.

All things considered, we were quite happy with the aligner’s performance given the complexity of the comparison task and the multiple potential variations between historical text and modern machine translations. To give an example of how fine-grained and at the same time highly flexible our matching parameters needed to be, see the below article ‘Gynaecocracy’, which is a fairly direct translation on a rather specialised subject, but that nonetheless matched on only 8 content words (fig. 2).

Figure 2. Comparisons of the article ‘Gynaecocracy’.

Other straightforward articles were however missed due to differences in the translation and sparse matching n-grams, see for example the small article on ‘Occult’ lines in geometry below, where the 6 matching words weren’t enough to constitute a match for the aligner (fig. 3).

Figure 3. Comparisons of the geometry article ‘Occult’.

Obviously this is a rather inexact science, reliant on an outside process of automatic translation and the ability to match a virtual text that in reality never existed. Nonetheless the 81% recall rate we attained on our sample corpus seemed more than sufficient for this experiment and allowed us to move forward towards a more general evaluation of the entirety of identified matches.

Once settled on the optimal parameters, we then Text-PAIR to generate both an alignment database, for interactive examination, and a set of static files. Both of these results formats are used for this project. The alignment database contains some 7304 aligned passage pairs. The system allows queries on metadata, such as author and article title as well as words or phrases found in the aligned passages. The system also uses faceted browsing to allow the user to summarize results by the various metadata (for more on this, see Note below). Each aligned passage is presented as a facing page representation and the user can toggle a display of all of the variations between the two aligned passages. As seen below, the variations between the texts can be extensive (fig. 4).

Figure 4. Text-PAIR interface showing differences in the article ‘Air’.

Text-PAIR also contextualises results back to the original document(s). For example, the following is the article ‘Almanach’ by D’Alembert, showing the aligned passage from Chambers in blue (fig. 5).

Figure 5. Article ‘Almanach’ with shared Chambers passages in blue.

In this instance, D’Alembert reused almost all of Chambers’ original article ‘Almanac’, with some minor variations, but does not to appear to have indicated the source of the first part of his article (page image).

The alignment database is a useful first pass to examine the results of the alignment process, but it is limited in at least two ways. It identifies each aligned passage, but does not merge multiple passages identified in in article pairs. Thus we find 5 shared passages between the articles ‘Constellation’. The interface also does not attempt to evaluate the alignments or identify passages that occur between different articles. For example, D’Alembert’s article ‘ATMOSPHERE’ indeed has a passage from Chambers’ article ‘Atmosphere’, but also many longer passages from the article ‘Generation’.

To accumulate results and to refine evaluation, we subsequently processed the raw Text-PAIR alignment data as found in the static output files. We developed an evaluation algorithm for each alignment, with parameters based on the length of the matching passages and the degree to which the headwords were close matches. This simple evaluation model eliminated a significant number of false positives, which we found were typically short text matches between articles with different headwords. The output of this algorithm resulted in two tables, one for matches that were likely to be valid and one that was less likely to be valid, based on our simple heuristics – see a selection of the ‘YES’ table below (fig. 6). We are, of course, making this distinction based on the comparison of the machine translated Chambers headwords and the headwords found in the Encyclopédie, so we expected that some valid matches would be identified as invalid.

Figure 6. Table of possible article borrowings.

The next phase of the project included the necessary step of human evaluation of the identified matches. While we were able to reduce the work involved significantly by generating a list of reasonably solid matches to be inspected, there is still no way to eliminate fully the ‘arduous toil’ of comparison referenced by Lough. More than 5000 potential matches were scrutinised, looking in essence for ‘false negatives’, i.e., matches that our evaluation algorithm classed as negative (based primarily on differences in headword translations) but that were in reality valid. The results of this work was then merged into in a single table of what we consider to be valid matches, a list that includes some 3700 Encyclopédie articles with at least one matching passage from the Cyclopaedia. These results will form the basis of a longer article that is currently in preparation.

Conclusions

In all, we found some 3778 articles in the Encyclopédie that upon evaluation seem highly similar in both content and structure to articles in the 1741 edition of Chambers’ Cyclopaedia. Whether or not these articles constitute real acts of historical translation is the subject for another, or several other, articles. There are simply too many outside factors at play, even in this rather straightforward comparison, to make blanket conclusions about the editorial practices of the encyclopédistes based on this limited experiment. What we can say, however, is that of the 1081 articles that include a ‘Chambers’ reference in the Encyclopédie, we only found 689 with at least one matching passage. Obviously this recall rate of 63.7% is well below the 81% we attained on our sample corpus, probably due to overfitting the matching algorithm to the sample, which warrants further investigation. But beyond testing this ground truth, we are also left with the rather astounding fact of 3089 articles with no reference to Chambers whatsoever, all of which seem, at first blush, to be at least somewhat related to their English predecessors.

The overall evaluation of these results remains ongoing, and the ‘arduous toil’ of traditional textual comparison continues apace, albeit guided somewhat by the machine’s heavy hand. Indeed, the use of machine translation as a bridge between documents to find similar passages, be they reuses, plagiarisms, etc., is, as we have attempted to show here, a workable approach for future research, although not without certain limitations. The Chambers–Encyclopédie task outlined above is fairly well constrained and historically bounded. More general applications of these same methods may well yield less useful results. These reservations notwithstanding, the fact that we were able to unearth many thousands of valid potential intertextual relationships between documents in different languages is a feat that even a few years ago might not have been possible. As large-scale language models become ever more sophisticated and historically aware, the dream of intertextual bridges between multilingual corpora may yet become a reality. (For more on ‘intertextual bridges’ in French, see our current NEH project.)

Note

The question of the Dictionnaire de Trévoux is one such factor, as it is known that both Chambers and the encyclopédistes used it as a source for their own articles – so matches we find between the Chambers and Encyclopédie may indeed represent shared borrowings from the Trévoux and not a translation at all. Or, more interestingly, perhaps Chambers translated a Trévoux article from French to English, which a dutiful encyclopédiste then translated back to French for the Encyclopédie – in this case, which article is the ‘source’ and which the ‘translation’? For more on these particular aspects of dictionary-making, see our previous article ‘Plundering philosophers: identifying sources of the Encyclopédie’, Journal of the Association for History and Computing 13.1 (Spring 2010) and Marie Leca-Tsiomis’ response, ‘The use and abuse of the digital humanities in the history of ideas: how to study the Encyclopédie’, History of European ideas 39.4 (2013), p.467-76.

– Glenn Roe and Mark Olsen

From the mundane to the philosophical: topic-modelling Voltaire and Rousseau’s correspondence

Voltaire and Rousseau’s correspondence are two fascinating collections which have perhaps not received the amount of attention than they could have due to the nature of these texts. Written over five decades, these letters cover a wide range of topics, from the mundanity of everyday concerns to more elaborate subjects. Getting an overall picture of these correspondences is challenging for the simple reader. This is unfortunate since these correspondences not only constitute a window into the private lives of Voltaire and Rousseau, or show an unfiltered expression of their respective thoughts, but they are also an example of the eclecticism professed by the philosophes. Fortunately modern computational techniques can truly help in providing an overview of the content of these letters and hopefully recapture – in a somewhat organized fashion – this very eclecticism of the Lumières. Thanks to the collaboration between the Voltaire Foundation and the ARTFL Project, I will be briefly discussing how topic-modeling can be used to draw an overall picture of these correspondences, and show a couple of examples of the model built from the Voltaire letters.

The ARTFL Project has long been engaged in exploring 18th-century discourses using digital tools, and the thematic opacity of correspondences is an ideal use-case for topic-modelling. This particular algorithm was designed to generate clusters of closely related words (or topics) by analyzing all word co-occurrences in any given corpus. Because these topics are extracted from their source texts, they are understood to describe the contents of the corpus analyzed. We recently released a topic-modelling browser – called TopoLogic – which was designed to explore such clusters of co-occurring words, and ran a preliminary experiment against the French Revolutionary Collection, the results of which can be seen here. When we built the topic models for Voltaire and Rousseau’s correspondences, we made sure to use the same parameters for both collections such that 40 topics (or discourses) were generated from each set of letters. We also only used those letters written by Voltaire on one side, and Rousseau on the other, hoping that we could perhaps make some comparisons between both models.

Let’s start with the Voltaire model, from which you can see the first 20 topics below:

As a first view into the topic model, the browser gives us the top 10 words for each topic, as well as their overall prevalence in the letters by Voltaire. From there we can further explore any topic, such as 16, which seems to map to Voltaire’s idea of the philosophe fighting against religious intolerance. By clicking on the topic however, we get an overview of how the topic is distributed in time, most important words in the topic, correlated topics, as well as documents where the topic is prominent (see figure below).

Let’s focus on several sections of this overview. We note below that the terms of philosophe and philosophie are weighted far more heavily than any other term, suggesting perhaps that all other words in this cluster may just constitute different characteristics of the philosophe in Voltaire’s eyes: religious concerns (prêtre, jésuite, religion, tolérance), attributes (honnête, sage), means of expression (article, livre).

All of these observations can of course be verified by exploring letters that feature topic 16 in a prominent way, which the browser does list. We can also see how the philosophe discourse evolves over the more than sixty years of Voltaire’s letters. Unsurprisingly, as his public involvement in religious affairs increases, the prevalence of such terms discussing his idea of the philosophe rises as well in his letters.

Among the discourses which tend to follow the same trend over time (see figure below), the cluster of terms related to justice (topic 5) stands out, once again showing that his public involvement is mirrored in his private correspondence. While these aspects are nothing really new, they provide for the prospective reader an easy way to find those letters that do discuss these topics.

Another interesting aspect of topic-modeling is that we can also examine the discursive make-up of any of Voltaire’s letters, and see if there are any other letters that share the same themes. Let’s examine Voltaire’s famous letter to Rousseau in which he mocks the citoyen de Genève’s position on the impact of literature in the second discourse (see figure below): ‘Les Lettres nourissent l’âme, la rectifient, la consolent’.

When we look at topical representation of this letter in the browser, we can note that the model found a number of different topics within this letter, which when combined do provide an overview of its contents. In it, Voltaire discusses – with much irony – his own experience as a writer (topic 33), which includes his role as historiographe du roi (topic 36), as well as the many controversies he was involved in (topic 10). He sarcastically laments the fact that he cannot afford to live with savages in a distant land (topic 25) because his health requires him to be treated by a doctor (topic 26 and 35). And as a whole, he defends the role of literature as a positive good for man (topic 0). Of course, one could argue that this topical structure is approximate, prone to discussion, and this is certainly true. However, this approximation is now available for all 15,000 letters, which then allows the computer to compare and group letters by this very topical structure. In this same document view, we can see documents which share a similar mixture of topics, such as a letter to Ivan Shuvalov from 1757 where Voltaire discusses his writing of history while displaying a very keen concern for the perception and impact of his writing, or another to D’Alembert where he complains about his bad health while stressing the importance of writing about useful things (‘il y avait cent choses utiles à dire qu’on n’a point dittes encore’).

One last aspect of the topic model is to examine the individual uses of words and the different contexts in which they are used. If we look at the uses of écrivain in the correspondences (see figure below), we can see how that its uses span across different types of discourses related to reason, the writing of history, or the public role of the writer. Looking at the actual word associations, we also note potentially interesting patterns. In the case of words that share similar topic distributions (used with a similar mix of discourses), a group of terms related to ignorance seems to dominate: fausseté, mensonge, ignorance, vérité, erreur, fable… This may allude to a sense of mission in Voltaire’s writings: to correct inaccuracies, to dispel lies, to reestablish the truth in the face of ignorance. Looking this time at words that tend to co-occur with écrivain, we get a very different picture, with terms that relate more to the activity of writing and the product of that writing. These two views on word associations do not contradict one another, but suggest different ways of thinking of the role of the écrivain as depicted in Voltaire’s letters.

To finish, let’s take a look at the topic model of Rousseau’s correspondence, and in particular how we can relate it to that of Voltaire. A quick overview of the first 20 topics in Rousseau’s letters reveals a similar – yet distinct – picture of the topical composition of his correspondence (see figure below).

Using the browser, we could track down Rousseau’s response to Voltaire’s criticism of the second discourse, and see if other letters discuss similar themes. This is all within the scope of this browser. For the sake of brevity however, and to show how topic models can be used to run comparative experiments, we wanted to focus on Rousseau’s usage of the word écrivain in order to see if and how it differed from what was suggested in the Voltaire model. As we can see below, Rousseau tends to use the term in similar contexts: the écrivain is invoked first and foremost as a conveyor of truth. But looking more closely at word associations, a distinctive pattern does emerge: such terms as lâche, haine, hypocrite, acharnement, or jalousie highlight a well-known trait of Rousseau, his paranoia in the face of his success as a writer. Clicking on any these words in the browser would allow a researcher to track down the individual uses of these terms as they relate to écrivain, and find those letters to discuss his persecution complex.

To conclude, we are well aware that any analysis provided here is purely built on the patterns derived from the topic models, and as such, remain unproven until verified by a close reading of the letters themselves. However, we hope to have shown how using a tool such as topic modeling can potentially provide new insights into the correspondences of Voltaire and Rousseau, or at the very least offer better guidance to scholars working on these two incredibly rich collections.

Clovis Gladstone

This article was first published in the Café Lumières blog in June 2020.

Clovis Gladstone’s Rousseau et le matérialisme appeared in Oxford University Studies in the Enlightenment 2020:8.

 

The Newberry French Revolution Collection at ARTFL

As we begin planning Digitizing Enlightenment IV, which will take place in the context of the ISECS Congress in Edinburgh in July 2019, we are keen to broaden the scope and breadth of the Digitizing Enlightenment community in order to highlight new, and existing, digital projects across the interdisciplinary spectrum of eighteenth-century studies. This post, based on work presented at the Digitizing Enlightenment III workshop held in Oxford in July 2018, demonstrates how to identify text reuse – citations, borrowings, plagiarisms – as well as other techniques for leveraging freely available large data-sets from the 18C.
– Glenn Roe, Voltaire Lab

The incredible richness of the Newberry Library’s French Revolution Collection (FRC) has been long known. It consists of more than 30,000 pamphlets and more than 23,000 issues of 180 periodicals published between 1780 and 1810, representing the opinions of all the factions that opposed and defended the monarchy during the turbulent period between 1789-1799 and also contains innumerable ephemeral publications of the early First Republic. The Newberry has released digital copies of more than 35,000 pamphlets totalling approximately 850,000 pages. Not only has the Newberry made the collection available to the public, but it has released a data feed of the entire collection, consisting of the Library’s exceptional metadata describing each object, the OCR text data, and links to the digital facsimiles accessible from the Internet Archive, encouraging researchers and instructors to incorporate the digital collection in new kinds of scholarship and engagement.

In order to facilitate experimental work at ARTFL on this unparalleled resource, we have loaded two versions of this collection – based on a download of the collection from the Newberry’s GitHub repository in November 2017 – into PhiloLogic4, the latest release of ARTFL’s text analysis software. The full version contains all 38,377 documents dating from the 16th century to the end of the 19th century. Our second build attempts to eliminate duplicate documents, is restricted to the period 1787-1799, and thus contains 26,445 documents.   Additional implementation information and full open access to both versions of the FRC collection are available online. The quality and coverage of the FRC texts makes it an ideal environment to test a variety of experiments and algorithms to enhance access and open new kinds of approaches using the 1787-99 sample data. At the bottom of the ARTFL FRC page, we have provided links to several different models for examining the collection which are based on extensions to the PhiloLogic4 package.

The simplest model is a document level search which returns matching documents by relevancy ranking based on Python Whoosh. This functions somewhat like a Google search on the collection, with links to the page images of the document or specific instances of the search words in context. For example, the results of a search for “conspirateurs aristocrates ennemis étrangères royalistes” can be seen here.

The second approach is the application of a Topic Model algorithm to the collection. Topic Models are a set of unsupervised learning algorithms that divide collections into a specified number of clusters based on vocabularies of each document which is widely used in digital humanities. The results of the Topic Model has been added to the metadata of the PhiloLogic4 build of the 1787-99 sample data. Each document is identified as having a first and second topic, denoted as A or B, with a number from 00-49 as listed in this TABLE. This first column is the topic number, the second is one or more english keywords which can also be searched. The third column is the top 3 weighted words (features) of that topic, and the 4th column is the rest of the top 10, all of which are shown in relative weight order. Thus, A29 will return the documents that have money assignats as the top weighted topic. Searching for “money” in topic models will get this as eight the first or second topic.   An alternative use of this data is to copy some or all of the terms in columns 3 and 4 into the Whoosh search form and get the documents in a ranked relevancy order.

Our first presentation of our work at the Digitizing Enlightenment III showed results from applying the latest version of our sequence aligner to detect text reuse – citations, borrowings, plagiarisms, and so on – from pre-Revolutionary documents during the Revolutionary period. Sequence alignment is a family of algorithms used in a surprising range of disciplines from genetics to text analysis to identify similar segments of arbitrary length. For this work, we aligned the FRC 1787-99 sample against ARTFL’s Frantext pre-1788 collection. The Frantext sample contains 1,263 documents and is particularly strong in 18th century holdings. We loaded the results of the alignment run in a dedicated database which can be queried in a variety of ways, such as source and/or target metadata as well as by words in matching passages.

The public database (June 22, 2018 build) found 8,937 aligned passages, or which around 1,000 were identified algorithmically as banalities. Filtering out shorter alignments, less than 10 words, results in just under 7,000 passages. It is important to note that these numbers are very relative, since they can vary significantly depending on the approach we use to identify and merge, where appropriate, longer passages. The general frequencies are not particularly surprising. The following is a table of the number of borrowed passages in the FRC by author.

Montesquieu – 1,315

Rousseau – 1,133

Voltaire – 979

Mably – 303

Aulony – 263

Racine – 168

Helvétius – 167

D’Holbach* – 146

 

Saint-Simon – 135

Bossuet – 110

La Fontaine – 94

Diderot – 85

Corneille – 72

Mirabeau – 71

Boileau – 69

Bernardin – 67

Montaigne – 65

*D’Holbach appears as two entries due to slight metadata differences.

The yearly distribution of borrowings from the top three Enlightenment authors again follows a reasonable pattern.

The annual distribution in the FRC of the 536 passages derived from Rousseau’s Contrat Social, seems reasonable and would match expectations based on other things we know.

While the global numbers are interesting, if not very surprising, there are number of specific texts and authors which would warrant further investigation. There are numerous chapbooks, such as the Calendrier moral, 1794, which are interesting because of their selection of inspiring passages from various authors. Jean-Jacques Barthélemy’s L’Accord de la religion et de la liberté (1791) features some 25 long extracts from d’Holbach’s Système social.

The alignment database is available to the public. The database has a variety of useful features. This link will push a search for all of the aligned passages in the FRC from Rousseau’s Contrat Social greater than 10 words. The report is laid out chronologically (in this case by FRC year). Each instance shows the matching passages with available metadata, links to the context of each passage, and a button to highlight the differences in each matching pair. The facets on the right will allow you to get frequencies by author, title, year and so on. Clicking on those will return the corresponding text pairs.

We anticipate further experimental work on the FRC, most notably in using the excellent subject information as ways to assess the accuracy of Topic Modelling and to consider supervised learning algorithms to further classify the collection by subject.

It is our pleasure to acknowledge that the Newberry Library has released this extraordinary resource under the Open Data Commons Attribution License, ODC-BY 1.0.   We believe that this splendid collection and the Newberry’s release of all of the data will facilitate a generation of ground-breaking work in Revolutionary studies. If you find the collection useful, please do contact the Newberry Library to congratulate them on this wonderful initiative and how their efforts contribute to your research.

We would love to hear from you. Please send comments, suggestions and problem reports to artfl@artfl.uchicago.edu.

– Clovis Gladstone and Mark Olsen

 

Digitizing Enlightenment III

The Voltaire Foundation, in collaboration with the Cultures of Knowledge project, the Maison Française d’Oxford, the Oxford Centre for European History and the Centre for Early Modern Studies, was pleased to host the third instalment of the Digitizing Enlightenment conference series on the 19th and 20th of July. This was the first academic event organised under the auspices of the Voltaire Lab, and was made possible by further support from the John Fell Fund.

Digitizing Enlightenment (DE) is a conference series that is establishing its domain as a major area of innovation in the Digital Humanities. The first convening of DE was in Sydney in 2016, hosted by Simon Burrows at Western Sydney University. This first meeting launched a set of discussions around a common set of problems and identified topics for collaboration in pursuit of interoperability among six distinguished, and in some cases, long-standing DH projects in the field of Enlightenment Studies:

  1. The ARTFL Project (Chicago);
  2. Mapping the Republic of Letters (Stanford);
  3. The Comédie Française Registres Project (MIT/Paris-Sorbonne/Nanterre);
  4. The French Book Trade in Enlightenment Europe (Western Sydney);
  5. Electronic Enlightenment (Oxford); and
  6. MEDIATE (Radboud).

The second gathering in Nijmegen in June of 2017, hosted by Alicia Montoya at Radboud University, continued these discussions and opened up more lines of communication and possible collaborative research across Europe and expanded our working notion of ‘Enlightenment’ as an historical period. These meetings thus established an international network of major digital humanities projects working on 17th- and 18th-century European intellectual and literary history. As a group, these projects have sought to identify and work collaboratively on shared research problems, solutions, and resources generated by their respective research programs in order to facilitate more comprehensive approaches to some of the major problems in the field today.

Greg Brown and conference attendees, Maison française d’Oxford.

Digitizing Enlightenment III was, by design, more focused than the prior meetings: it was aimed more narrowly at the hot topic of historical prosopography and network analysis, an area in which we felt the DE network can potentially provide leadership, and which could provide technical solutions that might allow for the integration of a whole range of ambitious projects in this field. The first two conferences were modest in size and quite international: 15-20 papers over two days, with 30-40 people in attendance. With our narrower focus, the third meeting was somewhat smaller but even more international, with participants from Australia, Austria, France, Germany, the US, and the UK. Accordingly, its format was more concentrated, in the form of six thematic round-tables, each dedicated to proposal and discussion of functional solutions to real-world problems already encountered in network analysis and prosopography of this period.

These roundtables were organized around a set of basic questions that allowed participants to engage with the overall thematic of the conference, without necessarily being experts in the domain. Participants spoke briefly on each proposed question, which allowed for ample discussion and question time afterwards. These questions included:

  • Why prosopography? Why networks?
  • What are historical or intellectual networks?
  • What is social network analysis?
  • How to re-construct a social network?
  • Who or what is excluded from networks?
  • What lies beyond networks, beyond prosopography?
  • How to link, sustain, and maintain networks?

A final roundtable was dedicated to discussion of next and future steps in this collaborative work, and where it was decided that we should aim to hold another event either during or around next year’s ISECS International Congress on the Enlightenment in Edinburgh.

Greg Brown (standing) and Howard Hotson.

Participants were also treated to a reception and dinner at Balliol College, generously sponsored by the Bodleian Libraries.

Between roundtables, we invited participants to present some of the current projects that are underway in the broad field of digital Enlightenment studies. These short presentations included already established projects, such as Early Modern Letters Online, the Quill Project, and Six Degrees of Francis Bacon, as well as new projects, such as the sequel to Simon Burrow’s FBTEE project, Mapping Print, Charting Enlightenment, and projects not yet fully developed on an early modern digital gazetteer, a new prosopographical model for natural law academics, and a project underway at Stanford on 18th-century salons as ‘networks’.

Our hope is that the Digitizing Enlightenment brand will continue on into the future, both in the form of future meetings – at ISECS in 2019 and perhaps Chicago in 2020 – and in a volume currently being edited for the Oxford University Studies in the Enlightenment series, which draws its content from the first two meetings. Should you have any questions about these projects, or our vision for future Digitizing Enlightenment events, please feel free to contact us at: de3@digitizingenlightenment.com

– Gregory Brown and Glenn Roe

Voltaire Lab: new digital research tools and resources

As part of our efforts to establish the Voltaire Lab as a virtual research centre, we are pleased to announce a major update of the TOUT Voltaire database and search interface, expanding links between the ARTFL Encyclopédie Project and several new research databases made available for the first time. Working in close collaboration with the ARTFL Project at the University of Chicago – one of the oldest and better known North American centres for digital humanities research – we have rebuilt the TOUT Voltaire database under PhiloLogic4, ARTFL’s next-generation search and corpus analysis engine.

Image1

New Search interface for TOUT Voltaire

PhiloLogic4 is a powerful research tool, allowing users to browse Voltaire’s works dynamically by date or title, along with further faceted browsing using the ‘title’, ‘year’ and ‘genre’, combined with word and phrase searching. Word searches are greatly improved for flexibility and ease of display and now include four primary result reports:

  • Concordance, or search terms in their context
  • KWIC, or line-by-line occurrences of the search term
  • Collocation, or terms that co-occur most with the search term
  • Time Series, which displays search term frequency over time

The new search interface will allow users to formulate complex queries with relatively little effort, following lines of enquiry in a dynamic fashion that moves from ‘distant reading’ scales of exploration to more fine-grained close textual analysis.

Image2

TOUT Voltaire search results

Also in collaboration with ARTFL, we have just released the Autumn Edition 2017 of the ARTFL Encyclopédie, a flagship digital humanities project that for the past almost twenty years has made available online the full text of Diderot and d’Alembert’s great philosophical dictionary. This new release offers many new features, functionalities and improvements. The powerful new faceted search and browse capabilities offered by PhiloLogic4 allow users better to leverage the organisational structure of the Encyclopédie – classes of knowledge, authors, headwords, volumes, and the like. Further it gives them the possibility of exploring the interesting alternatives offered by algorithmically or machine-generated classes. The collocation search generates word-clouds or word lists that are clickable to obtain concordances for any of the words immediately. Further improvements include new author attributions, various text corrections, and better cross-referencing functionality.

Image3

New ARTFL Encyclopédie interface

This release also contains a beautiful new set of high-resolution plate images. Clickable thumbnail versions lead to larger images that can be viewed in much greater detail than was previously possible.

Image4

New high resolution plate images, ‘Imprimerie en taille douce’

Image5

Close up of plate image

Thanks to the Voltaire Foundation, full biographies of the encyclopédistes are directly accessible from within the ARTFL Encyclopédie simply by clicking on the name of the author of any given article. This information is drawn directly from Frank and Serena Kafker’s The Encyclopedists as Individuals: A Biographical Dictionary of the Authors of the Encyclopédie (SVEC 257, 1988) – still the standard reference for biographical information on the Encyclopédie’s 139 contributors. Our hope is that this first experiment will demonstrate the value of linking digital resources openly in ways that can add value to existing projects and, at the same time, increase the visibility of the excellent works contained in the Oxford University Studies in the Enlightenment back catalogue.

Finally, we have begun the work of establishing new research collections that will form the basis of the Voltaire Lab’s textual corpus. For example, working with files provided by Electronic Enlightenment, we have combined all of Voltaire’s correspondence with TOUT Voltaire. This new resource, which we are for the moment calling ‘TV2’, contains over 22,000 individual documents and more than 13 million words, making it one of the largest single-author databases available for research. Due to copyright restrictions in the correspondence files we cannot make the full dataset publicly available, however we are keen to allow researchers access to this important resource on a case-by-case basis. Students and scholars who wish to access the PhiloLogic4 build of TV2 should contact me here.

Glenn Roe

Voltaire Foundation appoints Digital Research Fellow

I am delighted to announce my appointment as Digital Research Fellow at the Voltaire Foundation for the academic year 2017-2018. This is the first Digital Humanities appointment in French at Oxford, and is made possible by the generosity of M. Julien Sevaux and the John Fell Fund. As Digital Research Fellow, I will oversee the creation of a pilot Digital Voltaire project, establishing a dataset that for the first time contains all of Voltaire’s works, including his correspondence, as well as undertake a series of computational experiments around the theme of ‘Visualising Voltaire’.

Voltaire, by Maurice Quentin de La Tour, 1735.

Voltaire, by Maurice Quentin de La Tour, 1735.

As the monumental print edition of the Complete Works of Voltaire nears completion, the Voltaire Foundation is currently preparing the ground for Digital Voltaire, an interactive and innovative digital edition of Voltaire’s Œuvres complètes. The pilot project we are embarking upon will thus bring together two key existing datasets: TOUT Voltaire, developed in collaboration with the ARTFL Project at the University of Chicago; and Voltaire’s letters, drawn from Electronic Enlightenment. The combined dataset will include more than 20,000 individual documents and over 11 million words, making this one of, if not the largest single-author databases available for digital humanities research. This resource, together with a focused research project to scope and understand its potential uses and applications, will enable the Voltaire Foundation to begin to create a conceptual and infrastructural framework for a broader, transformational Digital Voltaire, for which fundraising efforts have already begun.

The Visualising Voltaire project will become part of the soon-to-be-created ‘Voltaire Lab’ – a virtual space for new research experimentation and dissemination centred on Voltaire’s textual output and its relationship to the broader field of eighteenth-century studies. By interrogating the ‘big data’ of Voltaire’s texts at both a macro- and microscopic level, we hope to shed new light on Voltaire’s use of intertextuality, his most commonly used themes and literary motifs, his intellectual networks, and his development as a thinker. This research project will further benefit from close existing ties with the ARTFL Project and the newly-established Textual Optics Lab at the University of Chicago, and with the Labex OBVIL (‘Observatoire de la vie littéraire’) based at the Sorbonne; centres for digital humanities research and development in French studies where much of this type of analysis has been pioneered.

Visualising Voltaire will include a number of literary experiments to test the scholarly and critical value of a combined digital archive of Voltaire’s texts. Following on from the work of Franco Moretti and the Stanford Literary Lab, the project will investigate how we can apply distant reading approaches to this large corpus in order to discover new connections and patterns at scale, and, at the same time, how these new approaches can interact and intervene with our traditional close reading modes of analysis. To this end, we have identified two areas of research that we will pursue in 2017-2018, and that we hope will lead to further projects in the future.

Sequence alignment.

Sequence alignment in the intertextual edition of Raynal’s Histoire des deux Indes, Centre for Digital Humanities Research, Australian National University.

In the first instance, we will focus on Voltaire’s ‘intertextuality’ and how computational techniques such as sequence alignment – borrowed from the field of bio-informatics – can help us better understand the rich complexity of Voltaire’s writing practices. Indeed, one of the major research questions that has arisen from the preparation of the Complete Works of Voltaire concerns Voltaire’s unacknowledged use and reuse of other texts. This takes two forms: the widespread reuse (borrowing/theft/imitation) of works by other writers, and the equally widespread reuse of his own work. This is a huge subject that has never been satisfactorily studied until now.

In a second instance, the completion of the Complete Works of Voltaire on paper has also created the opportunity to provide an index to the whole of his writings, notably using automatic indexing and classification techniques developed in the fields of artificial intelligence and machine learning. In addition to our ‘traditional’ indexes of the paper editions, which can be digitised and leveraged for computational analysis, we will also aim to generate ‘thematic maps’ of Voltaire’s works and correspondence using both supervised and unsupervised machine learning algorithms such as vector space analysis and topic modelling. These sorts of approaches will, we hope, open up Voltaire’s writings in wholly new and exciting ways, creating opportunities for high-profile public engagement activities such as hackathons, and generating new areas of investigation for potential doctoral research students.

Choix de Chansons.

From Jean-Benjamin de Laborde’s Choix de Chansons, 1774 – subject of the ARC Discovery grant ‘Performing Transdisciplinarity’.

And finally, beyond these specific research projects, my role as Digital Research Fellow will entail making and maintaining connections with digital humanities teams both locally and internationally, building on past and current relationships to generate new research initiatives moving forward. We are interested, for example, in establishing a better understanding of the importance of Voltaire’s Enlightenment network and its participation in the larger eighteenth-century Republic of Letters, questions that can be addressed in collaboration with the Center for Spatial and Network Analysis at Stanford, and the Cultures of Knowledge project based in Oxford. The Voltaire Lab can thus become a venue for engaging with other complementary Oxford digital projects, such as the Newton Project, which will allow for broader access as well as further fundamental research. Newton is often seen as the key thinker who sets the agenda for Enlightenment scientific thinking – through his emphasis on empiricism and the experimental method – while Voltaire, the dominant intellectual figure of the Enlightenment, helps to popularise Newton’s scientific method across Europe. Voltaire’s role as a key critic and disseminator of ideas and texts is also an area of research to which digital approaches can bring much to bear, in particular by linking his correspondence to projects such as Western Sydney University’s French Book Trade in Enlightenment Europe and Mapping Print, Charting Enlightenment.

We are equally keen to investigate the deeply interdisciplinary nature of Voltaire’s work beyond the purely literary or even textual, and, more generally, of his role in the often-overlooked interplay of music, images, and text in eighteenth-century print culture. This is in fact the subject of our recently awarded Australian Research Council Discovery Grant, ‘Performing Transdisciplinarity’, which brings together a team of interdisciplinary researchers from the Australian National University, the Universities of Melbourne and Sydney, and Oxford.

The above are just a few of the countless avenues of research opened up by digital approaches to Voltaire’s work and legacy, and to which many more will be added as the larger Digital Voltaire project takes shape over the next few years. As the newly appointed Digital Research Fellow at the VF, I very much look forward to keeping you all informed on the results of these experiments and of the project’s evolution in due course.

– Glenn Roe

Tout Voltaire

09The Voltaire Foundation, in collaboration with the ARTFL Project, is pleased to announce the public release of the TOUT VOLTAIRE online database. This database brings you in fully searchable form all of Voltaire’s works apart from his correspondence (which can be searched separately, in Electronic Enlightenment).

Currently publishing the Complete works of Voltaire in print, the Voltaire Foundation plans to unveil an online version of this definitive critical edition sometime after 2018. In the meantime, this plain text version of Voltaire’s writings (without critical apparatus or notes) is the most reliable version available anywhere on the web.

The various editions used to establish this database are clearly marked: from the Voltaire Foundation’s own Complete works of Voltaire to nineteenth-century editions by Beuchot and Moland, among others.  When possible we have included Voltaire’s notes, as well as some textual variants depending on the edition. Pagination, however, is often not representative of the print editions, so if you wish to cite Voltaire for scholarly purposes, you should always consult the list of the best critical editions currently available.

The TOUT VOLTAIRE database is built using ARTFL’s full-text search and retrieval engine PhiloLogic, one of the oldest and most successful text analysis systems in the digital humanities. With a wide variety of search and reporting functions, users can look for words, groups of words, or phrases over Voltaire’s entire corpus, or in individual works (and even parts of works). Results can be displayed in context, as frequency reports (by title, by decade, etc.), or as a collocation table and word cloud.

Example searches could include:

For more search tips, please visit the PhiloLogic user manual.

This research tool is made available free of charge by the Voltaire Foundation (University of Oxford) and the ARTFL Project (University of Chicago). If you wish to make a contribution to our work, please contact the Voltaire Foundation.

Glenn Roe