The Corpus of Late Modern English Texts, version 3.0
The Corpus of Late Modern English Texts, version 3.0 (CLMET3.0) has been created by Hendrik De Smet, Hans-Jürgen Diller and Jukka Tyrkkö, as an offshoot of a bigger project developing a database of text descriptors (Diller, De Smet & Tyrkkö 2011). CLMET3.0 is a principled collection of public domain texts drawn from various online archiving projects. In total, the corpus contains some 34 million words of running text. It incorporates CLMET and CLMETEV, and has been compiled following roughly the same principles, that is:
However, compared to CLMET and CLMETEV, it comes with a number of important improvements (in addition to being substantially bigger):
| Sub-period | Number of authors | Number of texts | Number of words | |
| 1710-1780 | 51 | 88 | 10,480,431 | |
| 1780-1850 | 70 | 99 | 11,285,587 | |
| 1850-1920 | 91 | 146 | 12,620,207 | |
| TOTAL | 212 | 333 | 34,386,225 |
The corpus covers five major genres: narrative fiction, narrative non-fiction, drama, letters and treatise, in addition to a number of unclassified texts. The genre-division per sub-period is as follows:
| Genre | 1710-1780 | 1780-1850 | 1850-1920 | |
| Narrative fiction | 4,642,670 | 4,830,718 | 6,311,301 | |
| Narrative non-fiction | 1,863,855 | 1,940,245 | 958,410 | |
| Drama | 407,885 | 347,493 | 607,401 | |
| Letters | 1,016,745 | 714,343 | 479,724 | |
| Treatise | 1,114,521 | 1,692,992 | 1,782,124 | |
| Other | 1,434,755 | 1,759,796 | 2,481,247 |
To download the corpus, you can obtain a free password and user-id by contacting Hendrik De Smet. If you already have a password and user-id, simply click here to download or access.
----------
References:
Diller, H., De Smet, H., Tyrkkö, J. (2011). A European database of descriptors of English electronic texts. The European English Messenger 19, 21-35.