Editing and digitising marginalia

Voltaire’s comments on Frederick II’s L’Art de la guerre, Clement Draper’s depictions of chemical processes, Herman Melville’s pencil scores, or Samuel Beckett’s reading traces… these are all what we define as marginalia: the reader’s markings in the margins of a book. These markings are difficult to pin down in terms more specific than scribbles, references, and thoughts captured on a page. There is no apparent common rule that groups them together and specifies how they should be understood as a whole, even though they are often studied as an ensemble or a genre. Furthermore, the line – if there is a line – that defines the margins themselves is not always evident, and that is why scholars are constantly questioning what marginalia are, while trying to differentiate between the primary text and its annotations. As Laura Estill acknowledges in her article ‘Encoding the edge: manuscript marginalia and the TEI’, ‘perhaps there are easier distinctions to be made when marginalia is handwritten in printed books – although even then, in the case of authorial revisions, stop-press corrections, or (say) Whitman’s notes in another book, there is no easy answer as to what is “marginal”’.

A discussion of what exactly this marginal space is and how it interacts with the text is crucial when considering the central query of the Editing and Digitising Marginalia workshop: how can the marginalia of source material be encoded as fully, accurately, and helpfully as possible? By trying to define the purpose and character of Voltaire’s, Draper’s, Melville’s and Beckett’s marginalia, Nicholas Cronk, Gillian Pink, and Dan Barker; and Zoe Screti, Christopher Ohge, and Dirk Van Hulle respectively delved into the challenges of digitally editing marginalia, which requires a completely different framework of analysis compared to pre-digital editions or even digital facsimile editions. Following on from the OCTET colloquium on Writers’ Libraries, this workshop explored the importance of studying authors through their reading practices. It focused on the editorial choices behind digitally encoding marginalia, with the added layer of complexity that derives both from the difficulties and the possibilities of the digital medium.

When designing a data model that could represent marginalia as a key component of Voltaire’s complete works, for example, the verbal elements were comparatively easier to encode than the non-verbal marks. Voltaire used different materials to underline, draw, and mark the pages he was reading, or he folded, licked, and stuck them together. How can these practices possibly be translated into the digital sphere? For this digital project, the source material came from the transcribed print volumes of the Corpus des notes marginales de Voltaire, which were themselves one step removed from the original source material, since they had already undergone an editorial process that transformed the original squiggles into typeset signs.

Dan Barker, ‘The aim of digitising OCV’, picture taken by author.

Dan Barker, the Digital Consultant at the Voltaire Foundation, explained in his presentation ‘The aim of digitising OCV’ how he had created a system of mark types to record these marks in order to reproduce source material fully, accurately, and helpfully. He classified a mark according to nodes (the points where the lines meet or cross) or edges (uninterrupted lines) to convey their nature, presence, and relationship to the text. Even if the method does not account for the colour, medium, intensity, or even authorship of marginal marks, readers will be able to search for specific classifications of marks and see if Voltaire used them more than once and where. It is a process that operates within the principles proposed by Gillian Pink of what a new-born digital edition of a manuscript should be: legible, containing both visual and non-verbal elements, and searchable, taking into account the modernisation of the transcription to avoid the potential pitfalls of searching for idiosyncratic spellings.

The issue of searchability was further discussed by Zoe Screti, a postdoctoral researcher at the Voltaire Foundation, in her paper ‘Alchemical marginalia written in prison and cataloguing marginalia’. The quantity and diversity of Clement Draper’s marginalia, in the shape of memory aids, summaries, symbols, diagrams, or eyewitness accounts, are not reflected in the catalogue entries of his archival materials. That discrepancy points towards an incompatibility in the way catalogues were built and the questions that scholars are asking now, hence why Screti is updating the system with usability and consistency in mind, both of which aim to make sources of marginalia accessible and discoverable.

She has access to a subset of Voltaire’s manuscripts and is cataloguing them from scratch, which provides her with a decision-making margin that others might not be able to work with. They are also small in size, allowing for a detailed granularity that would be difficult to obtain if working with Draper’s notebooks, for example. But the challenges of ensuring that catalogues keep up with the pace of research on marginalia remain, in big and small collections alike. If we want to be able to locate specific categories of marginalia, as is the case with Voltaire’s non-verbal markings, and include nuances in our current search and text analysis tools, they need to appear in the catalogue entries, and that means going beyond filters and single codes.

Voltaire’s non-verbal annotations to the Marquis de Vauvenargues’s Introduction à la connaissance de l’esprit humain and their appearance in the Voltaire Foundation’s edition of the marginalia.

Finally, both Melville’s and Beckett’s marginalia are representative of common methodological issues in terms of how to create a uniform TEI data model. As Christopher Ohge explained in his talk entitled ‘Melville’s Marginalia Online, with some general provocations’, there is no solution that covers all cases of marginalia encoding, and that is why current projects have very different data models. He provided an overview of those differences, showing how in Keats’s Paradise Lost, a Digital Edition or Whitman’s marginalia to Thoreau’s A week on the Concord and Merrimack Rivers, marginalia are wedged into the hierarchy of the existing text to make it work within different structures, while Archaeology of Reading has a bespoke XML tagging structure with a marginalia attribute.

But changing content IDs and crossing over the hierarchy of line elements or having a general term that does not include subtleties is not the methodological solution chosen for Melville’s Marginalia Online. This research tool uses software developed by the Whitman Project to generate the page coordinates of the already uploaded facsimile images, to find a page directly with a word search. Melville’s marginalia are encoded in a <div> tag with several attribute values, so as to include all detail and information. The question posed by Ohge then was as follows: how much context is needed to understand marginalia, and how much granularity?

In an intervention entitled ‘Editing Beckett’s Marginalia’, Dirk Van Hulle answered by stating that it depends on the author, the type of marginalia they wrote, and the resources available for the digital project that provides such context. One of the key elements that digital marginalia allows, as is the case with Beckett, is an insight not only into the reader himself, but the underlying structure of all his drafts and notebooks: a network of markings that, in turn, puts into context how his reading engendered his writing.

In order to make that network visible and searchable, one of the solutions going forward is to use IIIF (International Image Interoperability Framework) as a means of engaging with marginalia. Making resources IIIF compliant ensures they are interoperable with other software, as well as easy to maintain as an online resource with which scholars can interact. It is also culturally inclusive, as it operates on a ‘blank canvas’ principle meaning that non-codex objects can be presented in full.

A piece of marginalia in Voltaire’s copy of the Marquis de Vauvenargues’s Introduction à la connaissance de l’esprit humain demonstrating a stark difference in line weight.

IIIF image viewers could potentially work with improving transcription software, such as Transkribus, to allow for comprehensive resources that can display an image of the page with all its marginalia, paratext, and physical attributes as well as an interactive description and viewable transcription. The ability to describe elements of a text accurately and efficiently via pinpointing areas that have their own locus of metadata, as IIIF is capable of, means that more effort can be devoted to accurate scholarship, which is precisely what Gillian Pink stated in her paper ‘Editing Voltaire’s commentary on Frederick II’s L’Art de la guerre – third time lucky?’ She proposed, for example, to use different colours for the different hands that worked on the manuscript (Frederick II, his secretary, and Voltaire) as a way to take advantage of annotation possibilities with IIIF. However, the question remains: how can we decide which textual blocks should be transcribed as a unit in order to properly represent Voltaire’s marginalia?

The various contributions to the Editing and Digitising Marginalia workshop helped us sketch some answers to this question. Nonetheless, many threads were left to pull, ensuring that, hopefully, there will be another workshop to show how all the projects have built on existing methods while defying their own limits and scope, so that we keep rediscovering authors through the marginal notes that they left.

– Joana Roque

Related Posts

Mon été avec Voltaire: la numérisation de la collection Lambert-David

En mai 2021, on m’a approchée pour procéder à la numérisation de la collection Lambert-David, une série de manuscrits de Voltaire appartenant au professeur Peter Southam. Après une première phase de tests réalisée par un photographe professionnel, M. François Lafrance, le travail s’est déroulé dans les locaux de la Faculté des lettres et sciences humaines situés sur le campus principal de l’Université de Sherbrooke. Comme l’université ne dispose pas d’une salle et d’un équipement professionnel attitrés à la numérisation d’archives, les autres membres de l’équipe et moi-même avons rassemblé le matériel nécessaire pour entreprendre le projet. Tout au long du processus, j’ai pris soin de suivre les recommandations et les pratiques de trois institutions patrimoniales (Bibliothèque et Archives nationales du Québec, Bibliothèque nationale de France et Musée Canadien de l’histoire) pour la numérisation des documents qui sont résumées dans le Recueil de règles de numérisation.* Je ferai ici un survol du processus de numérisation et de mon expérience de travail.

Le matériel et l’espace de travail

Pour numériser des documents, il existe deux types d’outils sur le marché: les numériseurs (scanner) et les appareils photo numériques. Le choix entre ces deux technologies dépend essentiellement de l’état des manuscrits qui composent la collection. Le numériseur possède plusieurs avantages: numérisation en très haute résolution, éclairage uniforme géré par l’appareil et aucune distorsion optique. Cependant, il est moins polyvalent que l’appareil photo (la surface de numérisation est limitée) et il est souvent beaucoup plus dispendieux. Dans le cadre de ce projet, nous avons opté pour l’appareil photo numérique. Voici donc une liste non exhaustive du matériel requis:

  • Un appareil photo numérique, de type DSLR et d’au moins 12 mégapixels (Recueil, p.10)
  • Un objectif: il est préférable d’opter pour un objectif de type zoom, c’est-à-dire avec une plage focale
  • Un support de stockage des données
  • Un ordinateur muni d’un logiciel de traitement d’images
  • Un trépied horizontal
  • Un système d’éclairage continu
  • Un arrière-plan de couleur neutre: blanc, gris ou noir
  • Une charte des couleurs: ColorChecker ou charte Q-13 (Recueil, p.12)

Notre objectif était de mettre en place un espace de travail temporaire à la fois efficace et économique. Plusieurs articles de l’équipement nous ont été donnés ou prêtés. L’appareil photo numérique (Canon Rebel T7) et son objectif (EF-S 18-55 mm) ont été empruntés au Comptoir de prêt du Service de soutien à la formation de l’Université de Sherbrooke. Le propriétaire de la collection, Pr. Southam, possédait déjà un trépied horizontal et M. Lafrance nous a gentiment imprimé un fond gris. Pour la captation et le traitement des images, j’ai utilisé mon ordinateur personnel équipé d’une vieille version de Photoshop et de Camera Raw. J’ai également installé le logiciel gratuit de Canon (EOS Utility), pour la captation des images à distance,ainsi que le logiciel de conversion de fichiers d’Adobe (Adobe Digital Negative Converter). L’utilisation d’un logiciel d’acquisition photo permet de régler les paramètres et déclencher l’appareil à distance évitant ainsi de toucher à l’appareil pendant la numérisation.

L’éclairage est un élément crucial pour prendre de belles photos. L’utilisation d’une lampe-éclair professionnelle est fortement conseillée (Recueil, p.12), mais il s’agit d’une pièce d’équipement très dispendieuse. Nous avons donc opté pour un éclairage en continu. J’ai d’abord emprunté quatre lampes DEL au Comptoir de prêt, mais comme il s’agit de matériel fortement en demande, j’ai terminé le projet avec un ensemble d’éclairage personnel composé de deux boîtes à lumière avec diffuseurs.

La pièce d’équipement qui a été la plus difficile à obtenir est la charte des couleurs. Bien que la ColorChecker soit plus fiable et plus facilement accessible, nous avons opté pour la charte Tiffen Q-13 de Kodak. Pour faciliter le traitement et s’assurer que chaque image est indépendante, la charte des couleurs devait se retrouver à côté de toutes les pages photographiées. Le format de la ColorChecker (12×9 cm) était donc peu pratique comparativement au format de la charte Q-13 (19×6 cm). De plus, par souci d’uniformité, il s’agissait de la même charte utilisée par le photographe François Lafrance lors de la première phase du projet.

Enfin, il fallait choisir un environnement de travail approprié. Il est préférable d’opter pour un local assez sombre (sans fenêtre) où la lumière ambiante est tamisée et constante. L’espace doit également être propre et exempt de poussière (Recueil, p.12). On dispose ensuite l’équipement en s’assurant de bien éclairer la surface de numérisation.

La démarche

Avant d’entamer le processus d’acquisition des images, il faut ajuster les paramètres de l’appareil photo. Il y a au moins quatre facteurs à prendre en considération: l’exposition, la balance des blancs, la netteté (mise au point) et le format d’enregistrement. L’exposition est influencée par trois paramètres: la sensibilité du capteur (ISO), l’ouverture du diaphragme et la vitesse d’obturation. Dans le cadre du projet, selon la suggestion du photographe François Lafrance, j’ai utilisé une ouverture de f/8, une vitesse de 1/100 et une sensibilité ISO de 100. Ces paramètres ne sont présentés ici qu’à titre indicatif et il peut être nécessaire de les ajuster selon la quantité de lumière disponible. L’important est d’avoir une image nette et bien exposée. J’ai également ajusté la balance des blancs en utilisant l’outil de EOS Utility et la charte des couleurs (cette étape fut effectuée au début de chaque séance de numérisation). Enfin, j’ai réglé les paramètres d’enregistrement dans un format RAW (CR2).

On peut maintenant procéder à la photographie des manuscrits. Il faut d’abord positionner le document sur la surface de numérisation. Il est important de bien le mettre à plat lorsque l’état du document le permet. Autrement dit, on doit respecter le manuscrit, sa préservation ayant priorité sur la qualité du fichier. On effectue ensuite le cadrage de l’image à l’aide de la ‘visée par l’écran’ du logiciel d’acquisition. J’ai affiché les grilles et les repères afin de m’aider à centrer le document. Il est conseillé de garder une marge minimum de 1 cm tout autour du manuscrit et de s’assurer que la charte des couleurs soit visible dans son entièreté. Il ne reste plus qu’à effectuer la mise au point et appuyer sur le déclencheur. Tous les documents ont été numérisés dans leur intégralité (recto et verso), même lorsque les pages ne contenaient aucun texte. Les fichiers ont été enregistrés sur un disque dur externe en prenant soin de les classer dans des dossiers bien identifiés.

Les fichiers (CR2) obtenus doivent ensuite être traités. Il fallait d’abord les convertir dans un format RAW universel (DNG) à l’aide du logiciel Adobe Digital Negative Converter. J’ai par la suite effectué quelques traitements à l’aide de Camera Raw(extension dePhotoshop). Pour chacune des images, j’ai effectué les traitements suivants: la balance des blancs, l’exposition et la correction de la déformation optique. J’ai ensuite ouvert le document dans l’interface de Photoshop et ajouté l’étiquette pour identifier le manuscrit. Toutes les images ont été enregistrées en format TIFF non compressé. Bien que je n’aie effectué aucun traitement de couleur (l’équipement à ma disposition ne me permettait pas de calibrer les appareils), j’ai pris soin de traiter et enregistrer les images dans l’espace colorimétrique recommandé dans le Recueil de règles de numérisation, soit Adobe RGB 1998 (p.11). Puisque la charte de couleur se retrouve sur chaque image, un tel traitement pourra être effectué a posteriori.

Des documents fragiles et précieux: les défis rencontrés

Pendant le processus de numérisation, j’ai rencontré des documents particuliers qui furent plus difficiles à numériser. Puisque l’intégrité et la préservation des archives ont priorité sur les copies numériques, les solutions devaient éviter toute altération du document original.

Correspondance Frédéric-Voltaire, f.1.

Le premier défi fut la numérisation d’un document qui avait été collé avec du ruban adhésif. Le ruban empêchait non seulement une bonne mise à plat, mais camouflait également une ligne de texte. Enlever le ruban adhésif aurait fortement endommagé le manuscrit. Après plusieurs discussions au sein de l’équipe et selon la suggestion du propriétaire, Pr. Southam, j’ai coupé une petite partie du ruban adhésif. Heureusement, cette manipulation a pu se faire sans altérer le document original.

Correspondance Decroze-Voltaire.

Le deuxième défi fut d’assurer une bonne mise à plat de documents pliés plusieurs fois. A la demande du propriétaire, j’ai aplati les documents en les compressant entre deux livres lourds. Ils ont ensuite été remis au propriétaire dans cet état (sans les replier) à la demande de celui-ci.

Vers de Voltaire.

Le troisième défi fut de photographier un manuscrit de très grande taille (41×78 cm). Le trépied de Pr. Southam ne me permettait pas de reculer suffisamment l’appareil pour capter le document dans son ensemble. J’ai d’abord essayé de procéder à main levée, mais l’instabilité et le manque d’éclairage rendaient l’exercice très difficile. J’ai donc utilisé un autre trépied que j’ai fabriqué avec mon père pendant mes vacances. Celui-ci me donnait une plus grande marge de manœuvre et me permettait d’éloigner davantage l’appareil du sujet à photographier. Enfin, pour assurer une bonne visibilité du texte dans l’image, j’ai également photographié le manuscrit en plus petites sections en suivant le sens de lecture.

Théâtre de Voltaire, t.1.

Le dernier défi, et non le moindre, fut la numérisation d’un document relié de 250 pages. La manipulation de ce document était très délicate et il était impossible de faire une mise à plat complète sans briser sa reliure. Une étudiante à la maîtrise en littérature de l’Université de Sherbrooke, Frédérika Jean, est donc venue m’aider. Elle tenait le document ouvert pendant que je prenais la photo. Pour faciliter le processus, le recto de toutes les pages a été photographié dans un premier temps et le verso dans un second temps.

Au cours de l’été, j’ai eu la chance de découvrir une collection exceptionnelle de manuscrits voltairiens. La numérisation de celle-ci permettra, à long terme, de faciliter son analyse et sa diffusion. Ce court article n’a pas la prétention de s’imposer comme outil de référence pour toute entreprise de numérisation. Il vise plutôt à conserver la mémoire technique et méthodologique de cette expérience en décrivant avec précision et transparence les étapes qui ont mené à la création des fichiers numériques de la collection Lambert-David. Cette expérience démontre bien qu’en combinant les efforts et les ressources de plusieurs intervenants, il est possible d’installer un espace de travail à la fois efficace et économique.

Espace de travail.

Je terminerai en remerciant toute l’équipe du ‘projet Voltaire’, Pr. Peter Southam, Dr Gillian Pink de la Voltaire Foundation, Pr. Louise Bienvenue, Pr. Nicholas Dion, Pr. Anick Lessard et M. Rock Blanchard, doyenne et directeur administratif de la Faculté des lettres et sciences humaines de l’Université de Sherbrooke, ainsi que tous les collaborateurs ayant participé de près ou de loin au projet. Un merci spécial à François Lafrance, Yvon Blouin – mon père, et Frédérika Jean pour leur aide précieuse. Cette expérience m’a non seulement permis de perfectionner mes compétences en numérisation d’archives, mais également d’en apprendre davantage sur un personnage et une période de l’histoire que je connaissais trop peu. Je peux maintenant presque dire que j’ai passé un été en compagnie de l’un des philosophes des Lumières les plus célèbres: Voltaire!

Une pièce d’équipement fabriquée sur mesure

Le trépied du Pr. Peter Southam fut adéquat pour l’ensemble du projet. Cependant, la hauteur maximale du trépied ne permettait pas de photographier les documents de très grande taille. La numérisation d’un seul document ne justifiait pas l’achat d’un nouveau trépied ou d’un nouvel objectif. Pendant mes vacances, mon père et moi avons entrepris de confectionner un autre support. Après avoir pris connaissance de mes besoins, mon père, un homme très ingénieux et talentueux, a été en mesure de fabriquer un trépied horizontal à l’aide d’objets recyclés. Nous y trouvons, entre autres, deux bâtons de hockey, un vieux trépied trouvé dans une vente à débarras, un ancien poteau de tente, une équerre combinée sans sa règle ainsi que différents éléments issus d’objets amassés au fil des années. Le tout a ensuite été peint afin de le rendre plus uniforme. Ce nouveau prototype de trépied horizontal permet d’éloigner suffisamment l’appareil photo du sujet photographié afin de numériser des documents de grande taille dans leur entièreté. De surcroît, il est facilement démontable et ajustable. Le résultat est pratique, économique et écologique!

Sonia Blouin

Le support fabriqué.

* Marie-Chantal Anctil, Michel Legendre, Tristan Müller, Dominique Maillet, Kathleen Brosseau et Louise Renaud, Recueil de règles de numérisation [en ligne] (Bibliothèque et Archives nationales du Québec, Bibliothèque nationale de France, Musée canadien de l’histoire, 2014), sur le site Bibliothèque et Archives nationales du Québec, http://collections.banq.qc.ca/bitstream/52327/2426216/1/4671601.pdf, consulté le 10 août 2022.

From Cyclopaedia to Encyclopédie: experiments in machine translation and sequence alignment

Figure 1. Title page from the 1745 prospectus of the first Encyclopédie project. This page image is taken from ARTFL’s 18th Volume of the Encyclopédie.

It is well known that the Encyclopédie ou dictionnaire raisonné des sciences, des arts et des métiers began first as a modest translation project of Ephraim Chambers’ Cyclopaedia in 1745. Over the next few years, Diderot and D’Alembert would replace the original editors and the project would be duly transformed from a simple translation into an effort to compile and organise the sum total of the world’s knowledge. Over the course of their editorial work, Diderot, and most notably D’Alembert, were not shy in incorporating these translations of the Cyclopaedia as filler for the Encyclopédie. Indeed, ‘ils ont laissé une bonne partie de ces articles presque inchangés, ou avec des modifications insignifiantes’ (Paolo Quintili, ‘D’Alembert “traduit” Chambers. Les articles de mécanique de la Cyclopædia à l’Encyclopédie’, Recherches sur Diderot et sur l’Encyclopédie 21 (1996), p.75). The philosophes were nonetheless conscious of their debt to their English predecessor Chambers. His name appears some 1154 times in the text of the Encyclopédie and he is referenced as sole or contributing source to 1081 articles, where his name appears in italics at the end of a section or article. Given the scale of the two works under consideration, systematic evaluation of the extent of the philosophes’ use of Chambers has remained, even today, a daunting task. John Lough, in 1980, framed the problem nicely: ‘So far no one has had the patience to make a detailed study of the exact relationship between the text of Diderot’s Encyclopédie and the work of Ephraim Chambers. This would no doubt require several years of arduous toil devoted to comparing the two works article by article’(‘The Encyclopédie and Chambers’ Cyclopaedia’, SVEC 185 (1980), p.221).

Recent developments in machine translation and sequence alignment now offer new possibilities for the systematic comparison of digital texts across languages. The following post outlines some recent experimental work in leveraging these new techniques in an effort to reduce the ‘arduous toil’ of textual comparison, giving some preliminary examples of the kinds of results that can be achieved, and providing some cursory observations on the advantages and limitations of such systems for automatic text analysis.

Our two comparison datasets are the ARTFL Encyclopédie (v. 1117) and the recently digitised ARTFL edition of the 1741 Chambers’ Cyclopaedia (link). The 1741 edition was selected as it was one of the likely sources for the translation original project and we were able to work from high quality pages images provided by the University of Chicago Library (On the possible editions of the Cyclopaedia used by the encyclopédistes, see Irène Passeron, ‘Quelle(s) édition(s) de la Cyclopœdia les encyclopédistes ont-ils utilisée(s)?’, Recherches sur Diderot et sur l’Encyclopédie 40-41 (2006), p.287-92.) In a nutshell, our approach was to generate a machine translation of all of the Cyclopaedia articles into French and then use ARTFL’s Text-PAIR sequence alignement system to identify similar passages between this virtual French Cyclopaedia and the Encyclopédie, with the translation providing links back to the original English edition of the Chambers as well as links to the relevant passages in the Encyclopédie.

For the English to French machine translation of Chambers, we examined two of the most widely used resources in this domain, Google Translate and DeepL. Both systems provide useful Application Programming Interfaces [APIs] as part of their respective subscription services, and both provide translations based on cutting-edge neural network language models. We compared results from various samples and found, in general, that both systems worked reasonably well, given the complications of eighteenth-century vocabularies (in both English and French) and many uncommon and archaic terms (this may be the subject of a future post). While DeepL provided somewhat more satisfying translations from a reader’s perspective, we ultimately opted to use Google Translate for the ease of its API and its ability to parse the TEI encoding of our documents with little difficulty. The latter is of critical importance, since we wanted to keep the overall document structure of our dictionaries to allow for easy navigation between the versions.

Operationally, we segmented the text of the Cyclopaedia into short blocks, split at paragraph breaks, and sent them for automatic translation via the Google API, with a short delay between blocks. This worked relatively well, though the system would occasionally throw timeout or other errors, which required a query resend. You can inspect the translation results here – though this virtual French edition of the Chambers is not really meant for public consumption. Each article has a link at the bottom to the corresponding English version for the sake of comparison. It is important to note that the objective here is NOT to produce a good translation of the text or even one that might serve as the basis for a human edition. Rather, this machine-generated edition exists as a ‘pivot-text’ between the English Chambers and the French Encyclopédie, allowing for an automatic comparison of the two (or three) versions using a highly fault-tolerant sequence aligner designed to pick out commonalities in very noisy document spaces. (See Clovis Gladstone, Russ Horton, and Mark Olsen, ‘TextPAIR (Pairwise Alignment for Intertextual Relations)’, ARTFL Project, University of Chicago, 2008-2021, and, more specifically, Mark Olsen, Russell Horton and Glenn Roe, ‘Something borrowed: sequence alignment and the identification of similar passages in large text collections’, Digital Studies / Le Champ numérique 2.1 (2011).)

The next step was to establish workable parameters for the Text-PAIR alignment system. The challenge here was to find commonalities between the French translations created by eighteenth-century authors and translators and machine translations produced by a modern automatic translation system. Additionally, the editors and authors of the Encyclopédie were not necessary constrained to produce an exact translation of the text in question, but could and did, make significant modifications to the original in terms of length, style, and content. To address this challenge we ran a series of tests with different matching parameters such as n-gram construction (e.g., number of words that constitue an n-gram), minimum match lengths, maximum gaps between matches, and decreasing match requirements as a match length increased (what we call a ‘flex gap’) among others on a representative selection of 100 articles from the Encyclopédie where Chambers was identified as the possible source. It is important to note that even with the best parameters, which we adjusted to get favorable recall and precision results, we were only able to identify 81 of the 100 articles. (See comparison table. The primary parameters chosen were bigrams, stemmer=true, word len=3, maxgap=12, flexmatch=true, minmatchingngrams=5. Consult the TextPair documentation and configuration file for a description of these values.) Some articles, even where clearly affiliated, were missed by the aligner, due to the size of the articles (some are very small) and fundamental differences in the translation of the English. For example, the article ‘Compulseur’ is attributed by Mallet to Chambers, but the machine translation of ‘Compulsor’ is a rather more literal and direct translation of the English article than what is offered by Mallet. Further relaxing matching parameters could potentially find this example, but would increase the number of false positives, in effect drowning out the signal with increased noise.

All things considered, we were quite happy with the aligner’s performance given the complexity of the comparison task and the multiple potential variations between historical text and modern machine translations. To give an example of how fine-grained and at the same time highly flexible our matching parameters needed to be, see the below article ‘Gynaecocracy’, which is a fairly direct translation on a rather specialised subject, but that nonetheless matched on only 8 content words (fig. 2).

Figure 2. Comparisons of the article ‘Gynaecocracy’.

Other straightforward articles were however missed due to differences in the translation and sparse matching n-grams, see for example the small article on ‘Occult’ lines in geometry below, where the 6 matching words weren’t enough to constitute a match for the aligner (fig. 3).

Figure 3. Comparisons of the geometry article ‘Occult’.

Obviously this is a rather inexact science, reliant on an outside process of automatic translation and the ability to match a virtual text that in reality never existed. Nonetheless the 81% recall rate we attained on our sample corpus seemed more than sufficient for this experiment and allowed us to move forward towards a more general evaluation of the entirety of identified matches.

Once settled on the optimal parameters, we then Text-PAIR to generate both an alignment database, for interactive examination, and a set of static files. Both of these results formats are used for this project. The alignment database contains some 7304 aligned passage pairs. The system allows queries on metadata, such as author and article title as well as words or phrases found in the aligned passages. The system also uses faceted browsing to allow the user to summarize results by the various metadata (for more on this, see Note below). Each aligned passage is presented as a facing page representation and the user can toggle a display of all of the variations between the two aligned passages. As seen below, the variations between the texts can be extensive (fig. 4).

Figure 4. Text-PAIR interface showing differences in the article ‘Air’.

Text-PAIR also contextualises results back to the original document(s). For example, the following is the article ‘Almanach’ by D’Alembert, showing the aligned passage from Chambers in blue (fig. 5).

Figure 5. Article ‘Almanach’ with shared Chambers passages in blue.

In this instance, D’Alembert reused almost all of Chambers’ original article ‘Almanac’, with some minor variations, but does not to appear to have indicated the source of the first part of his article (page image).

The alignment database is a useful first pass to examine the results of the alignment process, but it is limited in at least two ways. It identifies each aligned passage, but does not merge multiple passages identified in in article pairs. Thus we find 5 shared passages between the articles ‘Constellation’. The interface also does not attempt to evaluate the alignments or identify passages that occur between different articles. For example, D’Alembert’s article ‘ATMOSPHERE’ indeed has a passage from Chambers’ article ‘Atmosphere’, but also many longer passages from the article ‘Generation’.

To accumulate results and to refine evaluation, we subsequently processed the raw Text-PAIR alignment data as found in the static output files. We developed an evaluation algorithm for each alignment, with parameters based on the length of the matching passages and the degree to which the headwords were close matches. This simple evaluation model eliminated a significant number of false positives, which we found were typically short text matches between articles with different headwords. The output of this algorithm resulted in two tables, one for matches that were likely to be valid and one that was less likely to be valid, based on our simple heuristics – see a selection of the ‘YES’ table below (fig. 6). We are, of course, making this distinction based on the comparison of the machine translated Chambers headwords and the headwords found in the Encyclopédie, so we expected that some valid matches would be identified as invalid.

Figure 6. Table of possible article borrowings.

The next phase of the project included the necessary step of human evaluation of the identified matches. While we were able to reduce the work involved significantly by generating a list of reasonably solid matches to be inspected, there is still no way to eliminate fully the ‘arduous toil’ of comparison referenced by Lough. More than 5000 potential matches were scrutinised, looking in essence for ‘false negatives’, i.e., matches that our evaluation algorithm classed as negative (based primarily on differences in headword translations) but that were in reality valid. The results of this work was then merged into in a single table of what we consider to be valid matches, a list that includes some 3700 Encyclopédie articles with at least one matching passage from the Cyclopaedia. These results will form the basis of a longer article that is currently in preparation.


In all, we found some 3778 articles in the Encyclopédie that upon evaluation seem highly similar in both content and structure to articles in the 1741 edition of Chambers’ Cyclopaedia. Whether or not these articles constitute real acts of historical translation is the subject for another, or several other, articles. There are simply too many outside factors at play, even in this rather straightforward comparison, to make blanket conclusions about the editorial practices of the encyclopédistes based on this limited experiment. What we can say, however, is that of the 1081 articles that include a ‘Chambers’ reference in the Encyclopédie, we only found 689 with at least one matching passage. Obviously this recall rate of 63.7% is well below the 81% we attained on our sample corpus, probably due to overfitting the matching algorithm to the sample, which warrants further investigation. But beyond testing this ground truth, we are also left with the rather astounding fact of 3089 articles with no reference to Chambers whatsoever, all of which seem, at first blush, to be at least somewhat related to their English predecessors.

The overall evaluation of these results remains ongoing, and the ‘arduous toil’ of traditional textual comparison continues apace, albeit guided somewhat by the machine’s heavy hand. Indeed, the use of machine translation as a bridge between documents to find similar passages, be they reuses, plagiarisms, etc., is, as we have attempted to show here, a workable approach for future research, although not without certain limitations. The Chambers–Encyclopédie task outlined above is fairly well constrained and historically bounded. More general applications of these same methods may well yield less useful results. These reservations notwithstanding, the fact that we were able to unearth many thousands of valid potential intertextual relationships between documents in different languages is a feat that even a few years ago might not have been possible. As large-scale language models become ever more sophisticated and historically aware, the dream of intertextual bridges between multilingual corpora may yet become a reality. (For more on ‘intertextual bridges’ in French, see our current NEH project.)


The question of the Dictionnaire de Trévoux is one such factor, as it is known that both Chambers and the encyclopédistes used it as a source for their own articles – so matches we find between the Chambers and Encyclopédie may indeed represent shared borrowings from the Trévoux and not a translation at all. Or, more interestingly, perhaps Chambers translated a Trévoux article from French to English, which a dutiful encyclopédiste then translated back to French for the Encyclopédie – in this case, which article is the ‘source’ and which the ‘translation’? For more on these particular aspects of dictionary-making, see our previous article ‘Plundering philosophers: identifying sources of the Encyclopédie’, Journal of the Association for History and Computing 13.1 (Spring 2010) and Marie Leca-Tsiomis’ response, ‘The use and abuse of the digital humanities in the history of ideas: how to study the Encyclopédie’, History of European ideas 39.4 (2013), p.467-76.

– Glenn Roe and Mark Olsen

Annotation in scholarly editions and research

It has been, alas, almost exactly a year since our last face-to-face Besterman Workshop at 99 Banbury Road. Of course, webinars allow more people to join, and to do so, most importantly, from the comfort of their homes, where they can sit comfortably and set their thermostats to the temperature that suits them best. The advent of the Zoom/Teams era, however, has brought with it a number of unfortunate consequences: discussions are not as lively as they used to be, asking a follow-up question is nearly impossible, and so are chats with friends and colleagues, before, during, or after the talk. Worst of all, we no longer get a chance to eat our beloved Leibniz or Belgian biscuits – but those, to be fair, had already become something of a rarity towards the beginning of 2018. Anyway: those of you who did attend our last face-to-face Besterman Workshops may remember this gloomy and cumbersome poster of mine hanging from the mantelpiece.

This poster was presented at a conference in Wuppertal, Germany, at the end of February 2019: ‘Annotation in Scholarly Editions and Research: Function – Differentiation – Systematization’. Organised by Julia Nantke (Universität Hamburg) and Frederik Schlupkothen (Bergische Universität Wuppertal), this two-day bilingual Anglo-German colloquium was a wonderful occasion to reflect on the age-old human habit of glossing, commenting, and generally interfering with other people’s work.

Alongside some theoretical papers (to mention but one, Willard McCarty’s brilliant keynote lecture on annotation as a knowledge-producing practice), the symposium featured several more practice-oriented talks that would have certainly been of interest to many of our Digital Humanities followers: some focused on how best to structure and visualise annotation in digital scholarly editions; others raised the question as to how to annotate audio-visual materials; and yet others investigated the extent to which annotation can be automated.

Some of the papers given at the ‘Annotation in Scholarly Editions and Research’ conference can now be read in a volume published last year (yes, in 2020!) by De Gruyter and available in print as well as an Open Access eBook.

My own contribution to the volume (which you can find here, should you want to read it) presents what I think might be an efficient and user-friendly three-level annotation system, the ‘reversible annotation system’, which I developed while working on Digital d’Holbach, a born-digital scholarly edition of Paul-Henri Thiry d’Holbach’s complete works. On this model, I argue, a single set of notes can be so structured as to cater to very different audiences, meaning that the edition can hope simultaneously to be user-friendly and cost-efficient. Should you have any comments or suggestions for improvement, please do not hesitate to let me know!

Ruggero Sciuto, University of Oxford

Digitising the margins: a classification of Voltaire’s scribbles

The most famous squiggly lines relating to eighteenth-century writing are almost certainly to be found in Tristram Shandy. Sterne uses them to illustrate the non-linearity of stories (see about halfway down that page) and digressions from the main narrative, before reviving the device several volumes later to render graphically for his readers the movement of the stick brandished by the character Trim. But these squiggles from 1761-1762 are far from alone. Both before and after Sterne’s foray into wiggly line design, Voltaire was peppering the margins of his books with marginalia, which involved both verbal and non-verbal elements – that is, words and squiggles.

When a team of Russian scholars began to publish the marginalia from his library in the 1970s in the Corpus des notes marginales de Voltaire, they decided that a facsimile edition would be both too expensive and not sufficiently clear to read. They settled on a compromise editorial policy, which entailed transcribing Voltaire’s words and reproducing graphically any accompanying marks and lines (usually made in ink or lead pencil, but also comprising scratches or indentations in the paper, for example crosses scored with the thumb-nail). When the edition passed to the Voltaire Foundation, we adhered to the same principles for the remaining volumes, much to the chagrin of our typesetter, who nevertheless heroically drew hundreds of scribbles electronically to incorporate into the typeset file.

Vauvenargues, p.90; OCV, vol.145, p.484.

The example above and those that follow are from books that Voltaire annotated with the intent of returning them to their authors with suggestions for improvement. In principle this should mean a greater likelihood that any shapes drawn should be intelligible and contribute to the meaning of the verbal marginalia. Indeed, in the first case, in a copy of Vauvenargues’s Introduction à la connaissance de l’esprit humain, we can see that the vertical wavy line in the margin brackets the passage generally, and is connected with the note ‘peu déve / lop[p]é’ (poorly developed), while the second + sign links ‘sage’ in the printed text to ‘fort’ in the margin, indicating that rather than referring to a wise person, the author should be talking about a strong person (in opposition to the weak person indicated by the first + sign higher up).

Vauvenargues, p.48; OCV, vol.145, p.477.

Here Voltaire uses + signs again to flag the word ‘dans’ twice at the top of the page, and indicates by the curved line and a further ‘dans’ in the margin that Vauvenargues should be consistent in beginning each in the series of adverbial clauses with the same preposition. At the beginning of the new section lower down, he uses a sort of Greek gamma in the margin to show that an insertion should be made. All very clear for the addressee of the annotations. And between those two? The squiggly line in the margin is hard to interpret and may simply bear testament to his reading: did he stumble on this passage? Did he dislike it? Perhaps he wanted to write a criticism or a suggestion but couldn’t decide on what to say. At any rate, the squiggle draws our eye, nearly 300 years after it was penned, to a passage to which Voltaire must also have paid particular attention.

Frederick, p.122; OCV, vol.145, p.156.

This final example is a bit different insofar as it is not actually in Voltaire’s hand, but is a careful copy made of an original that was subsequently destroyed in the bombing of Berlin during the Second World War. Slanted crosses, several with double verticals (reminiscent of the letter H), indicate lines of verse by Frederick, king of Prussia, with which Voltaire, preeminent poet of his day, was unhappy and which are commented in the margins. The ‘gamma’ again probably draws the king’s attention to the replacement word written over the line. Here, the limits of the typeset page become apparent as the slashing lines and crosses come so thick and fast that it becomes difficult to fit them all in. An apparatus of notes at the bottom of the page helps, but the effect at first glance is really not quite the same.

Digitising these volumes, as part the Voltaire Foundation’s new initiative Digital Enlightenment, poses new challenges, but can it also bring new solutions? On first analysis the infinitely flexible nature of Voltaire’s squiggles seems to be at odds with the ordered discipline inherent in our approach to digitising the Œuvres complètes. We soon decided that we were not going to scan every mark in the source volume and virtually paste it into the digital text – not only would madness likely that way lie, but also considerable expense, and it would be a distinctly inelegant way of solving the problem. The more you look at the corpus of squiggles, however, the more you see that although in strict terms you have a very large number of different marks, you have a much smaller number of different types of mark, and if we can successfully classify and label those types, we can use that classification and those labels when we digitise the content. Instead of the data saying ‘here’s a picture of a squiggle’, it will instead say ‘at this point there’s a mark of type X.’

How, then, to classify these marks? If you think of what makes up a mark or a squiggle, it will be one or more line-type marks, and where there is more than one line-type mark, they may meet or cross each other at a particular point. We call the line-type marks edges, and the points where they meet or cross nodes, and if you count the number of edges and nodes you find you have a ready-made way of classifying – and even sorting – your squiggles. For example:


has one edge, and no nodes:


has two edges, but still no nodes, and:

has one edge and one node. If we turn these counts into parts of a label (e.g. n0e1) we can start to distil order out of infinite variety, and we can pretty soon have an easy lookup for our digitisers to use:

There is, of course, a degree of discretion involved here in grouping marks according to type – there is a slanted line 10º from the vertical and another 10º from the horizontal, but what if we find a line precisely 45º from both? Or a vertical line that wiggles not once or twice but… seven times? Well, we may then need to add a shape and a code, but the method allows that, and if there’s one thing this digitisation exercise has taught us, it’s that until you’ve marked up the final full stop, novelty may at any time appear before you. Expect, and accommodate, the unexpected.

Using this method, we will be able to allow readers to search for particular marks. Or, more correctly, for particular classifications of marks, e.g. for ‘a straight line slanting from bottom left to top right at an angle of inclination less than 45º from the horizontal’ rather than for a specific slanting line. But the classification should be sufficiently specific that a reader encountering a mark in one text, and wondering where else Voltaire has used it, should be able to see the other relevant instances.

How will we deal with squiggles that defy classification? We defy squiggles to defy this classification! Time will, of course, tell, but we’re confident that we can accommodate anything that Voltaire felt necessary to add to the texts he was reading, blissfully unaware of the coding system that awaited his scribbles.

– Gillian Pink and Dan Barker, dancan Ltd


Exploring Voltaire’s letters: between close and distant readings

La lettre au fil du temps: philosophe

‘La lettre au fil du temps: philosophe.’

A stamp produced by the French post office in 1998 celebrates the art of letter-writing by depicting Voltaire writing letters with both hands. It’s true that Voltaire wrote a lot of letters – over 15,000 are known, and more turn up all the time – but even so it’s not altogether clear that an ambidextrous letter-writer is someone we entirely want to trust. Voltaire’s correspondence is full of difficulties and traps, and faced by such a huge corpus, it is hard to know where to start. Without question, the Besterman ‘definitive’ edition (1968-77), digitised in Electronic Enlightenment, has had a major impact on Enlightenment scholarship: historians and literary critics make frequent use of these letters, but usually in an instrumental way, adducing a single passage in a letter as evidence in support of a date or an interpretation.

Nicholas Cronk and Glenn Roe, Voltaire’s correspondence: digital readings (CUP, 2020)

Nicholas Cronk and Glenn Roe, Voltaire’s correspondence: digital readings (CUP, 2020).

Voltaire’s letters can be notoriously ‘unreliable’, however, and they really need to be read and interpreted – like all his texts – as literary performances. Few critics have attempted to examine the corpus of the correspondence in its entirety and to understand it as a literary whole. In our new book, Voltaire’s correspondence: digital readings, we have experimented with a range of digital humanities methods, to explore to what extent they might help us identify new interpretative approaches to this extraordinary correspondence. The size of the corpus seems intimidating to the critic, but it is precisely this that makes these texts a perfect test-case for digital experimentation: we can ask questions that we would simply not have been able to ask before.

For example, we looked at the way Voltaire signs off his letters – and were surprised to find that only 13% of the letters are actually signed ‘Voltaire’; while over a third of the letters are signed with a single letter, ‘V’. Then Voltaire is hugely inventive in the way he plays with the rules of epistolary rhetoric, posing as a marmot to the duc de Choiseul. And if you want to know why in a letter (D18683) to D’Alembert he signs off ‘Miaou’, the answer is to be found in a fable by La Fontaine…

We studied Voltaire as a neologist. Critics have usually described Voltaire as an arch-classicist adhering rigorously to the norms of seventeenth-century French classicism. True, yet at the same time he is hugely energetic in coining new words, an aspect of his literary style that has been insufficiently studied. Here, corpus analysis tools, coupled with available lexicographical digital resources, allow us to consider Voltaire’s aesthetic of lexical innovation. In so doing, we can test the hypothesis that Voltaire uses the correspondence as a laboratory in which he can experiment with new formulations, ideas, and words – some of which then pass into his other works. We identified 30 words first coined by Voltaire in his letters, and another 36 words first used in his other works, many of which are then reused in the correspondence. Emmanuel Macron has encouraged the description of himself as a ‘président jupitérien’, so it’s good to discover that ‘jupitérien’ is one of the words first coined by Voltaire.

Voltaire letter

A letter in Voltaire’s hand, sent from the city of Colmar to François Louis Defresnay (D5612, dated 1753/1754).

A reader of Voltaire’s letters cannot fail to be struck by the frequency of his literary quotations. We explore this phenomenon through the use of sequence alignment algorithms – similar to those used in bioinformatics to sequence genetic data – to identify similar or shared passages. Using the ARTFL-Frantext database of French literature as a comparison dataset, we attempt a detailed quantification and description of French literary quotations contained in Voltaire’s correspondence. These citations, taken together, give us a more comprehensive understanding of Voltaire’s literary culture, and provide invaluable insights into his rhetoric of intertextuality. No surprise that he quotes most often the authors of ‘le siècle de Louis XIV’, though it was a surprise to find that Les Plaideurs is the Racine play most frequently cited. And who expected to find two quotations from poems by Fontenelle (neither of them identified in the Besterman edition)?! Quotations in Latin also abound in Voltaire’s letters, many of these drawn, predictably enough, from the famous poets he would have memorised at school, Horace, Virgil, and Ovid – but we also identified quotations, hitherto unidentified, from lesser poets, such as a passage from Manilius’ Astronomica. By examining as a group the correspondents who receive Latin quotations, and assigning to them social and intellectual categories established by colleagues working at Stanford, we were able to establish clear networks of Latin usage throughout the correspondence, and confirm a hunch about the gendered aspect of quotation in Latin: Voltaire uses Latin only to his élite correspondents, and even then, with notably rare exceptions such as Emilie Du Châtelet, only to men.

The woman on the left, a trainee pilot in the Brazilian air force, is an unwitting beneficiary of Voltaire’s bravura use of Latin quotation. The motto of the Air Force Academy is a stirring (if slightly macho) Latin quotation: ‘Macte animo, generose puer, sic itur ad astra’ (Congratulations, noble boy, this is the way to the stars). The quotation is one that Voltaire uses repeatedly in some dozen letters, and it is found later, for example in Chateaubriand’s Mémoires d’outre-tombe. On closer investigation it turns out that this piece of Latin is an amalgam of quotations from Virgil and Statius – in effect, a piece of pure Voltairean invention.

In the end, Voltaire’s correspondence is undoubtedly one of his greatest literary masterpieces – but it is arguably one that only becomes fully legible through the use of digital resources and methods. Our intention with this book was to affirm the simple postulate that digital collections – whether comprised of letters, literary works, or historical documents – can, and should, enable multiple reading strategies and interpretative points of entry; both close and distant readings. As such, digital resources should continue to offer inroads to traditional critical practices while at the same time opening up new, unexplored avenues that take full advantage of the affordances of the digital. Not only can digital humanities methods help us ask traditional literary-critical questions in new ways – benefitting from economies of both scale and speed – but, as we show in the book, they can also generate new research questions from historical content; providing interpretive frameworks that would have been impossible in a pre-digital world.

The size and complexity of Voltaire’s correspondence make it an almost ideal corpus for testing the two dominant modes of (digital) literary analysis: on the one hand, ‘distant’ approaches to the corpus as a whole and its relationship to a larger literary culture; on the other, fine-grained analyses of individual letters and passages that serve to contextualise the particular in terms of the general, and vice versa. The core question at the heart of the book is thus one that remains largely untreated in the wider world: how can we use digital ‘reading’ methods – both close and distant – to explore and better understand a literary object as complex and multifaceted as Voltaire’s correspondence?

– Nicholas Cronk & Glenn Roe, Co-directors of the Voltaire Lab at the VF

Voltaire’s correspondence: digital readings will be published in print and online at the end of October. The online version is available free of charge for two weeks to personal and institutional subscribers.

From the mundane to the philosophical: topic-modelling Voltaire and Rousseau’s correspondence

Voltaire and Rousseau’s correspondence are two fascinating collections which have perhaps not received the amount of attention than they could have due to the nature of these texts. Written over five decades, these letters cover a wide range of topics, from the mundanity of everyday concerns to more elaborate subjects. Getting an overall picture of these correspondences is challenging for the simple reader. This is unfortunate since these correspondences not only constitute a window into the private lives of Voltaire and Rousseau, or show an unfiltered expression of their respective thoughts, but they are also an example of the eclecticism professed by the philosophes. Fortunately modern computational techniques can truly help in providing an overview of the content of these letters and hopefully recapture – in a somewhat organized fashion – this very eclecticism of the Lumières. Thanks to the collaboration between the Voltaire Foundation and the ARTFL Project, I will be briefly discussing how topic-modeling can be used to draw an overall picture of these correspondences, and show a couple of examples of the model built from the Voltaire letters.

The ARTFL Project has long been engaged in exploring 18th-century discourses using digital tools, and the thematic opacity of correspondences is an ideal use-case for topic-modelling. This particular algorithm was designed to generate clusters of closely related words (or topics) by analyzing all word co-occurrences in any given corpus. Because these topics are extracted from their source texts, they are understood to describe the contents of the corpus analyzed. We recently released a topic-modelling browser – called TopoLogic – which was designed to explore such clusters of co-occurring words, and ran a preliminary experiment against the French Revolutionary Collection, the results of which can be seen here. When we built the topic models for Voltaire and Rousseau’s correspondences, we made sure to use the same parameters for both collections such that 40 topics (or discourses) were generated from each set of letters. We also only used those letters written by Voltaire on one side, and Rousseau on the other, hoping that we could perhaps make some comparisons between both models.

Let’s start with the Voltaire model, from which you can see the first 20 topics below:

As a first view into the topic model, the browser gives us the top 10 words for each topic, as well as their overall prevalence in the letters by Voltaire. From there we can further explore any topic, such as 16, which seems to map to Voltaire’s idea of the philosophe fighting against religious intolerance. By clicking on the topic however, we get an overview of how the topic is distributed in time, most important words in the topic, correlated topics, as well as documents where the topic is prominent (see figure below).

Let’s focus on several sections of this overview. We note below that the terms of philosophe and philosophie are weighted far more heavily than any other term, suggesting perhaps that all other words in this cluster may just constitute different characteristics of the philosophe in Voltaire’s eyes: religious concerns (prêtre, jésuite, religion, tolérance), attributes (honnête, sage), means of expression (article, livre).

All of these observations can of course be verified by exploring letters that feature topic 16 in a prominent way, which the browser does list. We can also see how the philosophe discourse evolves over the more than sixty years of Voltaire’s letters. Unsurprisingly, as his public involvement in religious affairs increases, the prevalence of such terms discussing his idea of the philosophe rises as well in his letters.

Among the discourses which tend to follow the same trend over time (see figure below), the cluster of terms related to justice (topic 5) stands out, once again showing that his public involvement is mirrored in his private correspondence. While these aspects are nothing really new, they provide for the prospective reader an easy way to find those letters that do discuss these topics.

Another interesting aspect of topic-modeling is that we can also examine the discursive make-up of any of Voltaire’s letters, and see if there are any other letters that share the same themes. Let’s examine Voltaire’s famous letter to Rousseau in which he mocks the citoyen de Genève’s position on the impact of literature in the second discourse (see figure below): ‘Les Lettres nourissent l’âme, la rectifient, la consolent’.

When we look at topical representation of this letter in the browser, we can note that the model found a number of different topics within this letter, which when combined do provide an overview of its contents. In it, Voltaire discusses – with much irony – his own experience as a writer (topic 33), which includes his role as historiographe du roi (topic 36), as well as the many controversies he was involved in (topic 10). He sarcastically laments the fact that he cannot afford to live with savages in a distant land (topic 25) because his health requires him to be treated by a doctor (topic 26 and 35). And as a whole, he defends the role of literature as a positive good for man (topic 0). Of course, one could argue that this topical structure is approximate, prone to discussion, and this is certainly true. However, this approximation is now available for all 15,000 letters, which then allows the computer to compare and group letters by this very topical structure. In this same document view, we can see documents which share a similar mixture of topics, such as a letter to Ivan Shuvalov from 1757 where Voltaire discusses his writing of history while displaying a very keen concern for the perception and impact of his writing, or another to D’Alembert where he complains about his bad health while stressing the importance of writing about useful things (‘il y avait cent choses utiles à dire qu’on n’a point dittes encore’).

One last aspect of the topic model is to examine the individual uses of words and the different contexts in which they are used. If we look at the uses of écrivain in the correspondences (see figure below), we can see how that its uses span across different types of discourses related to reason, the writing of history, or the public role of the writer. Looking at the actual word associations, we also note potentially interesting patterns. In the case of words that share similar topic distributions (used with a similar mix of discourses), a group of terms related to ignorance seems to dominate: fausseté, mensonge, ignorance, vérité, erreur, fable… This may allude to a sense of mission in Voltaire’s writings: to correct inaccuracies, to dispel lies, to reestablish the truth in the face of ignorance. Looking this time at words that tend to co-occur with écrivain, we get a very different picture, with terms that relate more to the activity of writing and the product of that writing. These two views on word associations do not contradict one another, but suggest different ways of thinking of the role of the écrivain as depicted in Voltaire’s letters.

To finish, let’s take a look at the topic model of Rousseau’s correspondence, and in particular how we can relate it to that of Voltaire. A quick overview of the first 20 topics in Rousseau’s letters reveals a similar – yet distinct – picture of the topical composition of his correspondence (see figure below).

Using the browser, we could track down Rousseau’s response to Voltaire’s criticism of the second discourse, and see if other letters discuss similar themes. This is all within the scope of this browser. For the sake of brevity however, and to show how topic models can be used to run comparative experiments, we wanted to focus on Rousseau’s usage of the word écrivain in order to see if and how it differed from what was suggested in the Voltaire model. As we can see below, Rousseau tends to use the term in similar contexts: the écrivain is invoked first and foremost as a conveyor of truth. But looking more closely at word associations, a distinctive pattern does emerge: such terms as lâche, haine, hypocrite, acharnement, or jalousie highlight a well-known trait of Rousseau, his paranoia in the face of his success as a writer. Clicking on any these words in the browser would allow a researcher to track down the individual uses of these terms as they relate to écrivain, and find those letters to discuss his persecution complex.

To conclude, we are well aware that any analysis provided here is purely built on the patterns derived from the topic models, and as such, remain unproven until verified by a close reading of the letters themselves. However, we hope to have shown how using a tool such as topic modeling can potentially provide new insights into the correspondences of Voltaire and Rousseau, or at the very least offer better guidance to scholars working on these two incredibly rich collections.

Clovis Gladstone

This article was first published in the Café Lumières blog in June 2020.

Clovis Gladstone’s Rousseau et le matérialisme appeared in Oxford University Studies in the Enlightenment 2020:8.


Digitising Candide


Candide, title page of edition 299L (see OCV, vol.48, p.88).

In what is arguably his most widely known work, Voltaire describes the extraordinary journey that his eponymous hero undertakes through geography and understanding, and for us digitising the novel is the first step on the long and – we hope and trust – exciting journey to digitise the whole of the complete works, the OCV. As such it has been a proof of concept, a baptism of reassuringly gentle fire, and a taste of things to come.

For a digital file that’s worth its bytes we need much more than just electronic words. We need a format that will encode structure and meaning so that people and – just as importantly – programs can understand the extra information we’re embedding into the file, and use it to help make readers’ and scholars’ use of the material richer and easier.

Thankfully many others have trodden a similar path. Since the 1980s countless digital humanities minds have contributed to the Text Encoding Initiative, simultaneously a sophisticated tag set for marking up scholarly material, and a community engaged in maintaining that model, supporting the people who use it, and improving it based on collective experience, wisdom, and usage. We had no need to invent a wheel – TEI is beautifully adapted for our journey. We used it to design a tailored model to suit the particular needs of the OCV and Digital d’Holbach. This is being applied for us by our supplier, Apex CoVantage, who are assembling a specialist team and developing automated tools to streamline the workflow, and using the first dozen volumes as tools to train both people and software. Candide was their introduction to this fascinating marriage of the Enlightenment and the computer.

The structural tagging – for things like introductions and notes – will allow readers to see as much or as little detail and complexity as they wish, choosing between at one end of the scale just the edited version of Voltaire’s words, to at the other the full panoply of editorial introduction, notes, bibliographic citations, and textual variants, with a varying choice between the two extremes. It will also help readers navigate through and across the various parts of the volume, enabling their own particular journey.

Tagging for meaning – what we call the semantic tagging – is what will allow the dataset to communicate within itself, to other datasets, and also to humans. It’s what can help make search fully useful rather than just a literal echo of what a user types, and it can help a reader see a wider range of ‘next steps’ by making meaningful connections beyond those possible with just words and spaces. We tag people, places, dates, works, and institutions, and we’re also going to be developing a full set of metadata to accompany the datasets, as a rich and consistent layer describing the entire corpus in disciplined detail – we aim for this to be our contribution to the semantic web. We tag for primary and secondary content, and every piece of text has a language code associated with it so that if machine translation were applied to the data set we can choose which parts of an edition are translated (e.g. the introduction) and which are left in the original language (e.g. primary content quotes). Again, our work enables control and choice.

Part of the digital file of Candide

Part of the digital file of Candide, showing the end of the text of the novel.


The end of the novel in the Paris, Lambert, 1759 edition.

These two aspects turn a dataset into something akin to a machine (with the metadata as the auxiliary power unit), with multiple interlocking components that make it much easier for readers to summon or suppress the parts of the edition they need.

A machine needs precision in its gears and smoothness in its moving parts, and digitisation is revealing the odd snag and missing bolt where the tools we now have to analyse the workings were not available forty years ago. The exercise is therefore an opportunity to collate points we might wish to address in a revised edition (as well as revealing the occasional typographic error). But overall it’s gratifying how the abstract model we designed ahead of any full-scale digitisation has proved to be fit for purpose, and allows us to interrogate and improve the digital Candide by program, benefits which will increase exponentially as more volumes are added to the electronic corpus. The whole, we think, will be very much greater than the sum of its parts.

While the ultimate consumer of the digital files we’re creating will be human readers, the immediate consumers as intermediaries will be machines and processes, and even a cursory look at the ‘raw’ file of Candide shows you why. Character-for-character there is much more tagging than text, and for the eye simply to read the novel is near impossible; we keep tripping over indexing, line breaks, page breaks, emphasis, witness references … the list of tags is seemingly endless. What we see is ‘noise’ since we’re not programmed to filter one thing from another, but a program can be told to do exactly that, allowing any amount of filtering, cross-referencing, formatting, and even transformation to render the volume exactly as a reader requires. In order to ensure simplicity, but allow richness, and to enable choice, we have to make sure we start from complexity.

Digitisation and the accompanying process of metadata curation is all about preserving content, extending reach, and adding value. If we get this right, we should be laying the foundations for globally accessible tools of immense richness which will add to – and not detract from – the core material and scholarship on which it is all built. We have a responsibility to use the digital tools available to help as many people as possible find, read, and understand the extraordinary legacy of Voltaire and his contemporaries. Il faut cultiver nos données.

Dan Barker, dancan Ltd.

The Digitizing Enlightenment ‘twitterstorm’ of 3 August 2020

This past week our publication partner, Liverpool University Press, shipped out copies of Digitizing Enlightenment: digital humanities and the transformation of eighteenth-century studies, edited by Simon Burrows and Glenn Roe, the July volume of Oxford University Studies in the Enlightenment.

Rousseau’s Premier Discours

Frontispiece and title page of the first edition of Rousseau’s Premier Discours, on the question ‘Si le rétablissement des sciences et des arts a contribué à épurer les mœurs’.

To help launch this important book, on Monday 3 August Burrows and Roe, joined by Melanie Conroy, one of the contributors, organized a ‘twitterstorm’, inviting dix-huitiémistes working on digital humanities projects of any sort to post links of their work on Twitter, tagged with #DigitizingEnlightenment.

Over the course of 48 hours stretching from first light Sunday morning in eastern Australia to midnight Monday night on the Pacific coast of the United States, 112 unique tweets were posted from 28 accounts. The sequence of posts may be read, in reverse chronological order, here.

To enlighten and enliven the discussion, and in the spirit of eighteenth-century intellectual exchange, the Voltaire Foundation sponsored a competition, asking for the most creative and thoughtful response to the question: ‘Has the rise of  #dh been a boon or a barrier to #C18 studies?’

Twelve individuals posted responses, and the jury – consisting of Burrows, Roe and Conroy – deploying a sophisticated algorithm, ranked the entries and identified three runners-up and two winners.

The three runners-up were:

Helen Williams


As a first-gen scholar in the North East teaching & researching at a post-92 institution, #DigitizingEnlightenment is a boon, making the #18thcentury accessible & bringing diverse new voices, projects & approaches to scholarship & study. Many of us wouldn’t be here without it.

– Helen Williams (@helen189) August 3, 2020

Bryan Banks


Really excited to see this book come out!@SimonBu86342933 @glennhroe @MelanieConroy1 put the #DH in 𝐝ix-𝐡uitiemistes.

Today’s organized hashtag #DigitizingEnlightenment, like much DH work more broadly, makes the #18thC more legible and accessible to us today./1 https://t.co/IajlYLtPWk

– Bryan Banks (@BryanBanksPhD) August 3, 2020

Russell Goulbourne


Definitely a boon – because it’s the #DH analysis of huge numbers of texts that allows us to see that it’s precisely in the 1760s, at the height of the Enlightenment, that boon comes to mean “a benefit enjoyed”. QED. #DigitizingEnlightenment

– Russell Goulbourne (@FrenchProfessor) August 3, 2020

And now the winners:

Chad Wellmon

As Kant wrote 200+ years ago, DH has been a boon to #C18 studies. It’s a no-brainer @VoltaireOxford: “It is so easy to be immature. If I have a [computer] that has understanding for me, surely I do not need to trouble myself.” I. Kant, “An Answer to the Question ‘What is DH?’” https://t.co/wIGQJDT7p4

– chad wellmon (@cwellmon) August 3, 2020


Megan K. Roberts

I hate to be the lone skeptic, but I am concerned about the influence of #DH and #DigitizingEnlightenment on the field. Some projects are wonderful for research and teaching, but I worry that others place too much emphasis on an extremely select group of French philosophes.

– Meghan Roberts (@MeghanKRoberts) August 3, 2020


Both winners received copies of Digitizing Enlightenment as well as OSE’s June 2019 title, another volume of essays which deployed digital humanities methods to study the eighteenth century, Networks of Enlightenment, edited by Chloe Edmonston and Dan Edelstein.

As a supplement to the printed books, the data visualizations, tables and figures, as well as a portion of the text for each of these two volumes, are accessible on open access on the OSE ‘Digital Collaboration Hub’, built on the Manifold Scholar platform and hosted by Liverpool University Press. These may be accessed, appropriately, at http://digitizingenlightenment.com

Thanks to all who participated – and we all hope to be able to renew the annual ‘Digitizing Enlightenment’ symposium in July 2021, to be hosted at the University of Montpellier, in the context of the ‘Enquête sur la globalisation des Lumières’ initiative.

– Gregory S. Brown

NB: For the month of August, copies of Digitizing Enlightenment are available for purchase at a 25% discount. Purchasers in North America may order from the OUP-Global site using the code “DISTRO25” and purchasers anywhere else in the world, including UK, Europe and Australia,  may order from the LUP site using the code “DIGITIZING25“.

Digitizing the Enlightenment

As country after country has gone into COVID-19 lockdown, we have all had to learn to communicate, network, teach, study and relate online in ways unimaginable a few short years – or even months – ago. This phenomenon is just the latest stage in the information-technology revolution and part and parcel of the ongoing development of an increasingly digital society. This revolution has touched almost every aspect of our lives, from how we work, study, shop, relax and even make and maintain personal relationships. But it is also transforming scholarship and the way we conduct and communicate academic research. Thus, it is perhaps apt, and with consummate good timing, that Oxford University Studies in the Enlightenment has chosen to subject tag our new volume as ‘History of Scholarship (Principally of Social Sciences and Humanities)’. Yet this is certainly not how we and our collaborators envisaged our project at the outset, nor can any single tag capture the content of our volume and its collaborative agenda in its entirety.

The Digitizing Enlightenment workshop logo

The Digitizing Enlightenment workshop logo, designed by Evan Casey for the Voltaire Foundation, featured on the cover of Digitizing Enlightenment.

Ironically, as we write, Digitizing Enlightenment is also a living movement – or at least a loose network of scholars who meet annually in pursuit of a common agenda. That agenda was born in a series of conversations that took place from 2010, culminating in Dan Edelstein’s post-panel suggestion at the American Historical Association conference at Montreal in April 2014 that we should hold periodic meetings between like-minded digital projects relating to the Enlightenment. The aim of these meetings would be to establish common conventions and digital standards, with a view to linking our resources and realising the enormous and still largely untapped potential of Linked Open Data. Those present for Dan’s suggestion – Simon Burrows, Jeff Ravel, Sean Takats and Dan himself – have all provided chapters for our book, but much of the energy behind Digitizing Enlightenment since has come from Glenn Roe, who Simon had first encountered a month earlier in Australia, where they had both recently taken up academic positions.

It was this fortuitous coincidence, underpinned by the fertile combination of Simon’s professorial establishment funds and Glenn’s energy, together with their mutual contact books, that led to Western Sydney University hosting the first Digitizing Enlightenment symposium in July 2016. Among the projects discussed there, and in our book, were large-scale treatments of Enlightenment correspondences, theatre attendance records, and textual corpora including the mid-eighteenth century Encyclopédie; bibliometric projects were presented on the production and dissemination of literature; together with presentations on mapping and data visualization growing out of these projects. The symposium was so well received that it has been an annual event ever since. It was held at Radboud University in Nijmegen (2017), Oxford (2018), Edinburgh (2019). In 2020, but for COVID-19, it would have been held in Montpellier.

It was not entirely by chance that such a project coalesced around the guiding notion of the ‘Enlightenment’. For the long eighteenth century has been blessed by a number of high-profile and long-established digital projects. These include ground-breaking commercial datasets such as Gale-Cengage’s Eighteenth-Century Collections Online (ECCO), which features in several of our chapters, semi-commercial projects such as the Electronic Enlightenment and large academic consortiums such as the Franco-American ARTFL project. This made the Enlightenment a natural laboratory for exploring the possibilities and achievements of the Digital Humanities for transforming scholarship on a single historical era. Further, as our book emphases, our discussions built on a long tradition of digital innovation in eighteenth-century studies that can be traced back at least as far as the twin Livre et société dans la France du XVIIIe siècle volumes produced by a team led by François Furet in 1965 and 1970. It might further be added that our over-arching subject material lends itself to digital-historical analysis; the Enlightenment might after all be viewed as the long-run culmination of the intellectual turmoil and – as several contributors point out – information overload unleashed by a previous technological and communications revolution.

Digitizing Enlightenment is the July volume in the Oxford University Studies in the Enlightenment series

Digitizing Enlightenment is the July volume in the Oxford University Studies in the Enlightenment series.

With this in mind, then, we offer up Digitizing Enlightenment: Digital Humanities and the Transformation of Eighteenth-Century Studies as rather more than a contribution to the history of scholarship. Certainly, we have offered a sample of Digital Humanities c. 2016-2020, as it relates to the technologies available and their application to Enlightenment studies broadly construed. In addition, the first half of the book offers detailed accounts of the origins and development of key Enlightenment digital projects up until that point, accompanied by valuable and sometimes disarming insights on the dangers and delights of digital research from foremost practitioners in the field. These chapters, as well as some later contributions, are helping to reshape some dominant meta-narratives of the Enlightenment, not least by hinting simultaneously at the enduring aristocratic leadership of the French Enlightenment and the extent to which Enlightenment literary production and consumption was infused with religious content. However, our contributors also showcase other ways that Digital Humanities scholarship is in the process of changing the field through the transparency, methodological rigour, and collaborative imperatives that are necessary concomitants of this new kind of research. Finally, the book offers a collaborative roadmap for future digital research – at a moment where, as our final contributor, Sean Takats points out, the Enlightenment is fast losing its privileged position as the most richly digitized century of the modern era. As a corollary, we hope that our volume may be as useful to scholars of other periods as for Enlightenment scholars themselves.

– Simon Burrows (Western Sydney University) and Glenn Roe (Sorbonne University)

Simon Burrows and Glenn Roe are the editors of the July volume in the Oxford University Studies in the Enlightenment series, Digitizing Enlightenment: Digital Humanities and the Transformation of Eighteenth-Century Studies, which is the first book length survey of the impact of digital humanities on our understanding of a key historical period and paradigm.

This post is reblogged from Liverpool University Press.