The Newberry French Revolution Collection at ARTFL

As we begin planning Digitizing Enlightenment IV, which will take place in the context of the ISECS Congress in Edinburgh in July 2019, we are keen to broaden the scope and breadth of the Digitizing Enlightenment community in order to highlight new, and existing, digital projects across the interdisciplinary spectrum of eighteenth-century studies. This post, based on work presented at the Digitizing Enlightenment III workshop held in Oxford in July 2018, demonstrates how to identify text reuse – citations, borrowings, plagiarisms – as well as other techniques for leveraging freely available large data-sets from the 18C.
– Glenn Roe, Voltaire Lab

The incredible richness of the Newberry Library’s French Revolution Collection (FRC) has been long known. It consists of more than 30,000 pamphlets and more than 23,000 issues of 180 periodicals published between 1780 and 1810, representing the opinions of all the factions that opposed and defended the monarchy during the turbulent period between 1789-1799 and also contains innumerable ephemeral publications of the early First Republic. The Newberry has released digital copies of more than 35,000 pamphlets totalling approximately 850,000 pages. Not only has the Newberry made the collection available to the public, but it has released a data feed of the entire collection, consisting of the Library’s exceptional metadata describing each object, the OCR text data, and links to the digital facsimiles accessible from the Internet Archive, encouraging researchers and instructors to incorporate the digital collection in new kinds of scholarship and engagement.

In order to facilitate experimental work at ARTFL on this unparalleled resource, we have loaded two versions of this collection – based on a download of the collection from the Newberry’s GitHub repository in November 2017 – into PhiloLogic4, the latest release of ARTFL’s text analysis software. The full version contains all 38,377 documents dating from the 16th century to the end of the 19th century. Our second build attempts to eliminate duplicate documents, is restricted to the period 1787-1799, and thus contains 26,445 documents.   Additional implementation information and full open access to both versions of the FRC collection are available online. The quality and coverage of the FRC texts makes it an ideal environment to test a variety of experiments and algorithms to enhance access and open new kinds of approaches using the 1787-99 sample data. At the bottom of the ARTFL FRC page, we have provided links to several different models for examining the collection which are based on extensions to the PhiloLogic4 package.

The simplest model is a document level search which returns matching documents by relevancy ranking based on Python Whoosh. This functions somewhat like a Google search on the collection, with links to the page images of the document or specific instances of the search words in context. For example, the results of a search for “conspirateurs aristocrates ennemis étrangères royalistes” can be seen here.

The second approach is the application of a Topic Model algorithm to the collection. Topic Models are a set of unsupervised learning algorithms that divide collections into a specified number of clusters based on vocabularies of each document which is widely used in digital humanities. The results of the Topic Model has been added to the metadata of the PhiloLogic4 build of the 1787-99 sample data. Each document is identified as having a first and second topic, denoted as A or B, with a number from 00-49 as listed in this TABLE. This first column is the topic number, the second is one or more english keywords which can also be searched. The third column is the top 3 weighted words (features) of that topic, and the 4th column is the rest of the top 10, all of which are shown in relative weight order. Thus, A29 will return the documents that have money assignats as the top weighted topic. Searching for “money” in topic models will get this as eight the first or second topic.   An alternative use of this data is to copy some or all of the terms in columns 3 and 4 into the Whoosh search form and get the documents in a ranked relevancy order.

Our first presentation of our work at the Digitizing Enlightenment III showed results from applying the latest version of our sequence aligner to detect text reuse – citations, borrowings, plagiarisms, and so on – from pre-Revolutionary documents during the Revolutionary period. Sequence alignment is a family of algorithms used in a surprising range of disciplines from genetics to text analysis to identify similar segments of arbitrary length. For this work, we aligned the FRC 1787-99 sample against ARTFL’s Frantext pre-1788 collection. The Frantext sample contains 1,263 documents and is particularly strong in 18th century holdings. We loaded the results of the alignment run in a dedicated database which can be queried in a variety of ways, such as source and/or target metadata as well as by words in matching passages.

The public database (June 22, 2018 build) found 8,937 aligned passages, or which around 1,000 were identified algorithmically as banalities. Filtering out shorter alignments, less than 10 words, results in just under 7,000 passages. It is important to note that these numbers are very relative, since they can vary significantly depending on the approach we use to identify and merge, where appropriate, longer passages. The general frequencies are not particularly surprising. The following is a table of the number of borrowed passages in the FRC by author.

Montesquieu – 1,315

Rousseau – 1,133

Voltaire – 979

Mably – 303

Aulony – 263

Racine – 168

Helvétius – 167

D’Holbach* – 146


Saint-Simon – 135

Bossuet – 110

La Fontaine – 94

Diderot – 85

Corneille – 72

Mirabeau – 71

Boileau – 69

Bernardin – 67

Montaigne – 65

*D’Holbach appears as two entries due to slight metadata differences.

The yearly distribution of borrowings from the top three Enlightenment authors again follows a reasonable pattern.

The annual distribution in the FRC of the 536 passages derived from Rousseau’s Contrat Social, seems reasonable and would match expectations based on other things we know.

While the global numbers are interesting, if not very surprising, there are number of specific texts and authors which would warrant further investigation. There are numerous chapbooks, such as the Calendrier moral, 1794, which are interesting because of their selection of inspiring passages from various authors. Jean-Jacques Barthélemy’s L’Accord de la religion et de la liberté (1791) features some 25 long extracts from d’Holbach’s Système social.

The alignment database is available to the public. The database has a variety of useful features. This link will push a search for all of the aligned passages in the FRC from Rousseau’s Contrat Social greater than 10 words. The report is laid out chronologically (in this case by FRC year). Each instance shows the matching passages with available metadata, links to the context of each passage, and a button to highlight the differences in each matching pair. The facets on the right will allow you to get frequencies by author, title, year and so on. Clicking on those will return the corresponding text pairs.

We anticipate further experimental work on the FRC, most notably in using the excellent subject information as ways to assess the accuracy of Topic Modelling and to consider supervised learning algorithms to further classify the collection by subject.

It is our pleasure to acknowledge that the Newberry Library has released this extraordinary resource under the Open Data Commons Attribution License, ODC-BY 1.0.   We believe that this splendid collection and the Newberry’s release of all of the data will facilitate a generation of ground-breaking work in Revolutionary studies. If you find the collection useful, please do contact the Newberry Library to congratulate them on this wonderful initiative and how their efforts contribute to your research.

We would love to hear from you. Please send comments, suggestions and problem reports to

– Clovis Gladstone and Mark Olsen



Poetry in the digital age: the Digital Miscellanies Index and eighteenth-century culture

For most of us, reading for pleasure usually means getting stuck into some fiction or non-fiction. Poetry is a less common diversion, but we still have an appetite for poems to dip into, to find solace in, to memorise and share. And we can choose from an array of collections that promote poetry as an everyday companion, a form of therapy, and a tradition of national interest. For readers looking for peace of mind, The Emergency Poet: An Anti-Stress Poetry Anthology offers comfort, while the popular twin collections of Poems That Make Grown Men (or Women) Cry present a cult of sensibility for the modern age.

It was in the eighteenth century that poetry collections like these became a staple of literary publishing in Britain. The tradition of printed collections of English poetry stretches back to the sixteenth century, with Songes and Sonettes (1557), an edition of short lyric poems compiled by the publisher Richard Tottel, generally regarded as the foundation of English Renaissance poetry and the most important early printed collection of English verse. But it was not until the eighteenth century that collections of poems by several hands, with prose as a secondary feature, became one of the most common forms in which British readers encountered poetry. Like their modern counterparts, eighteenth-century editors and publishers sought to gain a foothold in a crowded market by targeting specific audiences and promoting the benefits of reading poetry. Some produced didactic collections for young people (Poems for Young Ladies); others pitched their collections to lovers in need of poetic inspiration (The Lover’s Manual); and many more set their sights on a local audience (The Oxford Sausage).

Poems for Young Ladies

Poems for Young Ladies (1767), edited by the poet Oliver Goldsmith.

Collections like these shaped the ways in which poetry was written and read throughout the eighteenth century. Yet until recently relatively little was known about their contents. Thanks to the Digital Miscellanies Index (DMI), this is no longer the case. The DMI provides a searchable record of the contents of over 1,600 collections of poems by several hands published over the course of the eighteenth century. These books are sometimes referred to as anthologies, as most poetry collections are today. But the word anthology, derived from the Greek for ‘a gathering of flowers’, has connotations that sit uneasily with many eighteenth-century poetry collections. Few collections produced in this period claimed to present the best of English poetry, a rationale often seen as characteristic of anthologies (collections that cull the flowers of the poetic tradition). As a result, several scholars, myself included, prefer the term miscellany. Derived from the Latin miscellanea, meaning a ‘hotchpotch’ of foodstuffs, it captures the dominant characteristic of most eighteenth-century collections: variety. A typical miscellany offers a varied feast of poems to entertain readers with varied tastes and personalities.

The DMI was launched in 2013, following three years of development and data collection carried out by a team based at the University of Oxford. Led by Abigail Williams and Jennifer Batt, the project was funded by the Leverhulme Trust. In 2014, another Leverhulme grant set in motion the second phase of the project. One of the aims of this phase, to be completed in 2017, is to harness the data now accessible via the DMI to shed new light on how miscellanies evolved, how they packaged and popularised poetry, and on the habits of their readers. At the same time, we are working with the Bodleian’s Digital Libraries team to develop the DMI into a more flexible and wide-ranging resource, and last month we celebrated a milestone on this road. The thirty-strong audience at Lines of Connection, a conference I co-organised as part of the project, were among the first to see the DMI’s new search interface, which replaces the beta site created in 2013.

The Book of Fun

The Book of Fun (1759), a miscellany dominated by seventeenth-century verse.

The new search platform is much more than a digital facelift for the DMI. It provides access to a database undergoing expansion: the latest version includes new records for miscellanies published between 1680 and 1699, and future updates will extend the DMI ’s coverage further back to Tottel’s foundational Songes and Sonettes. The redeveloped interface also enables users to explore the data in new ways. Keyword and phrase searching is quicker and more extensive with the new basic search function. There is also the option to filter the records using a number of facets, which display and rank the data in ways that suggest key trends and lines of enquiry. For instance, clicking on ‘Poem’ under ‘Content Type’, then selecting the ‘Related People’ facet, reveals a list of almost one hundred of the most prominent authors in the database, ranked according to the number of poems attributed to them. At the top of the list is John Dryden, with around 1,500 poems; the highest ranked French author is Nicolas Boileau-Despréaux, with over 120 poems in English translation (the DMI does not record appearances of poems in foreign languages). Although these figures should not be seen as straightforward indications of popularity, they remind us that many of the most widely read poets of the eighteenth century were those who had been active in the late seventeenth century. In his imitation of Horace’s epistle to Augustus (written 1737), Alexander Pope observed that the verse of his seventeenth-century predecessors was scattered ‘Like twinkling stars the Miscellanies o’er’. The DMI has made it possible to see these stars, and the sky around them, more clearly.

– Carly Watson

Digitizing Raynal (and Diderot): New Digital Editions of the Histoire des deux Indes

A collaborative digital research project

On the heels of Cecil Courtney and Jenny Mander’s recent publication, Raynal’s ‘Histoire des deux Indes’ colonialism, networks and global exchange (OSE, 2015), I am pleased to announce a new international research project aimed at further exploring Raynal’s monumental work and its impact on Enlightenment thought. Thanks to the generous support of the Consortium for the Study of the Premodern World at the University of Minnesota, the Centre for Digital Humanities Research at the Australian National University, Stanford University Libraries, and The ARTFL Project at the University of Chicago, we have recently completed the digitization and text encoding (in TEI-XML) of the three primary editions of the Histoire philosophique et politique des établissements et du commerce des Européens dans les deux Indes. These editions – the first edition of 1770, the second of 1774, and the 1780 third edition – were those that Raynal himself oversaw during his lifetime.

Our digital editions are based on high quality PDFs provided by the BNF’s Gallica online library (1770 and 1780 editions) and the Bodleian’s Oxford Google Books Project (1774 edition). A preliminary search interface has been built using the ARTFL Project’s PhiloLogic software and can be accessed here: Raynal search form. Users can query one or all of the above editions, which represent the first publicly available full-text digital edition(s) of the Histoire des deux Indes. In the coming months we will release a new version of the database running on ARTFL’s state-of-the-art PhiloLogic4 system, along with a preliminary ‘intertextual interface’ that will aim to incorporate the text of the three separate editions into one reading interface.


Title page and frontispiece of the 1780 edition of Raynal’s Histoire des deux Indes (Gallica).

Diderot, Hornoy, and the 1780 edition

What is perhaps most exciting about these new digital resources is the inclusion of a unique 1780 edition of the Histoire des deux Indes recently made available by the BNF. Acquired at public auction in March 2015, this particular edition had been conserved since the late 18th century in the private library of Alexandre Marie Dompierre d’Hornoy (1742-1828). A lawyer at the Parlement de Paris and great-nephew of Voltaire – he in fact inherited Jean-Baptiste Pigalle’s infamous nude statue of Voltaire upon his great-uncle’s death – Hornoy corresponded with many of the philosophes, Diderot included. His copy of the Histoire contains pencil marks in the margins of some passages, an unremarkable fact, perhaps, were it not for a note written by Hornoy just above a three-page insert at the beginning of the first tome. The handwritten tables included in the insert list all the sections marked in pencil over the four volumes of text: ‘mourceaux qui sont de M. Diderot’, Hornoy writes, ‘marqués en crayon par Mme de Vandeul’. Madame de Vandeul was, of course, Diderot’s daughter.


Handwritten insert of the 1780 edition (Gallica)

The existence of such an annotated volume of the Histoire was posited in the 19th century, notably by Joseph Marie Quérard in his Supercheries littéraires dévoilées (5 vols., 1845-1856). Quérard claimed that there supposedly existed a copy of the 1780 edition on which Diderot himself had marked in pencil all the passages that belonged to him [1]. According to Quérard, this copy became the property of Madame de Vandeul shortly after Diderot’s death. Whether or not the copy acquired by the BNF is the same as that owned by Vandeul we cannot say for sure, but Herbert Dieckmann, in his inventory of the ‘fonds Vandeul’, also mentions the hypothetical existence of a copy of the in-4o edition (e.g. 1780) that was purportedly annotated by hand, but that had since been lost [2].

Some preliminary experiments

While consensus as to the validity of Hornoy’s assertion that the marked sections are in fact those authored by Diderot will most likely take years to accrue, we can begin, using the new digital edition, to ask some basic questions as to the authorship claims indicated in the text. Thanks to extensive markup in TEI-XML notation, sections purportedly belonging to Diderot are clearly indicated, and perhaps more importantly, can be extracted as one test corpus. Using some basic statistical measures drawn from authorship attribution studies, or Stylometry, we can begin to think about how the ‘Diderot’ sections may, or may not, differ stylistically – i.e. in terms of comparative word usage over the most common words, an established metric of ‘authorship’ in stylometry and forensic linguistics – from the rest of the text.


Page from 1780 edition with ‘Diderot’ section marked in pencil (Gallica)

Working with the Centre for Literary and Linguistic Computing at the University of Newcastle (Australia), and in particular with their Intelligent Archive software for stylistic and statistical text analysis, we extracted the top 200 words for each ‘author’ (e.g. those drawn from sections putatively by Diderot, and the remaining ‘Raynal’ sections). As a result, we were left with 4 ‘Diderot’ tomes (containing all of the text marked in pencil) and 4 ‘Raynal’ tomes (containing the remainder), representing their unique word lists over the entire edition. For a first preliminary test, we ran a cluster analysis on the 8 tomes to see if they would cluster together or separately:


Cluster analysis of ‘Diderot’ tomes vs. ‘Raynal’ tomes, based on top 200 word lists

Cluster analysis works by separating (or clustering) the most similar texts first and the most distinct last, in this case into 2 branches. A division like the one above, clearly separated into two distinct ‘trees’ is a very clear indication that the texts in each of the two branches are highly likely to be those of two different authors.

Principal component analysis (PCA) provides another method of examining our corpora. PCA is a procedure for identifying a smaller number of uncorrelated variables, called ‘principal components’, from a large set of data. The goal of PCA is to explain the maximum amount of variance with the fewest number of principal components. In our case, it is a technique that allows for the first two principal components of our two sets of texts, i.e. their word variance, to be plotted on a bi-axial or two-dimensional graph. One of these plots (using the 100 most frequent words of the full text) with both text corpora divided into 10,000 word blocks, is shown below.


Principal component analysis using 10,000 word blocks and 100 most frequent words

The disparity in size of our two test corpora meant that while there were 68 text sections for Raynal (in green), there were only 14 for Diderot (in blue). Nonetheless, the separation between the two authorial sets is almost complete, with just two of the Diderot sections located in the outer fringes of the Raynal set. Since the word variables underlying this plot were the 100 most frequent words of the whole text, this is a convincing stylistic division, one that suggests a strong distinction in terms of authorship signal between the two sets.

In order to account for the size discrepancy between the two corpora, we ran another PCA test but this time we increased the number of Diderot sections by segmenting his text into 5,000 word blocks and running these against the previous Raynal 10,000-word sections. This plot is shown below:


Principal component analysis on 5,000 word blocks (Diderot) and Raynal, using 100 most frequent words

Here we see the same sort of authorial/stylistic separation as we saw above, but this time (with the Diderot sections halved in size) the distinction is even stronger, as there is only one section located within the Raynal set of entries, indicating an even greater likelihood that the sections marked in pencil were written by a different author than the rest of the 1780 edition.

These are obviously very rudimentary experiments, but they nonetheless indicate several promising future avenues of exploration. Moving forward, we intend to apply a full suite of computational and stylistic approaches to the 1780 edition and its predecessors, including sequence alignment tools developed by ARTFL, text collation software, and the MEDITE system developed by the labex OBVIL at the Sorbonne for computational genetic criticism. All of these approaches will allow us to explore the textual evolution of the Histoire from 1770 to 1780 in an unprecedented manner, as well as its relationship to other Enlightenment texts and text collections such as Electronic Enlightenment, TOUT Voltaire, and the Encyclopédie.

– Glenn Roe

*I would especially like to thank Alexis Antonia and the Centre for Literary and Linguistic Computing at Newcastle for their generous help with the above stylistic analyses.

[1] See Michèle Duchet, Diderot et l’Histoire des deux Indes ou l’écriture fragmentaire, Paris, Nizet, 1978, p. 22.

[2] Herbert Dieckmann, Inventaire du fonds Vandeul et inédits de Diderot, Genève, Droz, 1951.