The Corpus of Late Modern English Texts, version 3.0

The Corpus of Late Modern English Texts, version 3.0 (CLMET3.0) has been created by Hendrik De Smet, Hans-Jürgen Diller and Jukka Tyrkkö, as an offshoot of a bigger project developing a database of text descriptors (Diller, De Smet & Tyrkkö 2011). CLMET3.0 is a principled collection of public domain texts drawn from various online archiving projects. In total, the corpus contains some 34 million words of running text. It incorporates CLMET and CLMETEV, and has been compiled following roughly the same principles, that is:

However, compared to CLMET and CLMETEV, it comes with a number of important improvements (in addition to being substantially bigger):

The following table summarises the corpus make-up:

Sub-period Number of authors Number of texts Number of words
1710-1780 51 88 10,480,431
1780-1850 70 99 11,285,587
1850-1920 91 146 12,620,207
TOTAL 212 333 34,386,225

The corpus covers five major genres: narrative fiction, narrative non-fiction, drama, letters and treatise, in addition to a number of unclassified texts. The genre-division per sub-period is as follows:

Genre 1710-1780 1780-1850 1850-1920
Narrative fiction 4,642,670 4,830,718 6,311,301
Narrative non-fiction 1,863,855 1,940,245 958,410
Drama 407,885 347,493 607,401
Letters 1,016,745 714,343 479,724
Treatise 1,114,521 1,692,992 1,782,124
Other 1,434,755 1,759,796 2,481,247

To download the corpus, you can obtain a free password and user-id by contacting Hendrik De Smet. If you already have a password and user-id, simply click here to download or access.

----------

References:

Diller, H., De Smet, H., Tyrkkö, J. (2011). A European database of descriptors of English electronic texts. The European English Messenger 19, 21-35.