Samuel Johnson by Joshua Reynolds, 1775Offshoots of the OED

What is the noun most frequently modified by the adjective naked? Nope, not body, woman or man, although those are second, third and fourth on the list. It’s eye. Bet you didn’t know that.

Nor could you have — not without access to the clever software that hunts such things down in a corpus. That’s the immense body of text used by dictionary makers to divine the meanings and use of words.

The Oxford Corpus is the largest in the world, at two billion-plus words (total, not unique words). It became available to the makers of the Oxford English Dictionary in 2006. Words in the Corpus date from 2000 to 2006, and a few hundred thousand are added every year, so that it contains a broad slice of English as it is currently used. The texts are chosen mainly from online sources — books, newspapers, journals, manuals, blogs, chatrooms, even fanzines and “underground and counterculture websites”.

Digital corpora have revolutionised lexicography. Dictionary-making used to be agonisingly slow and painstaking — and it still is, but now software can do much of the work. From Samuel Johnson, the 18th-century author of the first English dictionary in its modern form, onwards, lexicographers have traced words as used by different writers at different times to show how each one attains its full range of meanings. It used to be done with “quotation slips” — five million of them for the first OED — and an army of volunteers and editors. Now, of course, no human army could possibly keep track of all the words we produce.

Dr Johnson and his forebears began their work at a time when books, newssheets and other printed materials had been truly widespread for barely a century. Linguistic conservatives saw in this democratic explosion of words a threat to pure and traditional language. Dictionaries were one way, so they hoped, to hold back the flood and ensure that “correct” usage was safeguarded. (Another was national language academies.) But preservation was a futile hope even in the 1700s, as Dr Johnson realised during his nine-year lexicographical labour: “When we see men grow old and die…, we laugh at the elixir that promises to prolong life to a thousand years; and with equal justice may the lexicographer be derided, who… shall imagine that his dictionary can embalm his language, and secure it from corruption and decay…”

Language lives and evolves, and the more the balance between spoken and written shifts towards the latter (although we still speak many times more words than we write), the more rapid change is likely to be. That’s without even taking account of globalisation, a centuries-old process whose traces are visible in the English language itself, with its throng of immigrant words, some of which are very ancient. Nor specialisation, which has given us millions of technical terms.

All this is the historical background to two new books from Oxford University Press, publishers of the OED. One, from which the Samuel Johnson quote above is lifted, is Jeremy Butterfield’s excellent Damp Squid: The English Language Laid Bare. Butterfield is a lexicographer, and he shows just how the Corpus can be put to use to accurately capture English as it is used now.

As with any other set of data, you have to ask the right questions. This is not a book for scholars, so Butterfield takes up examples that reveal to the average but alert English user just how astonishingly complex and context-dependent — and therefore subtle — his language is. His method is not just statistical (how many times certain words occur, how many originate in Greek, Irish, Hindi…), it’s also qualitative. Bank, for example: the software lines up chunks of text from the Corpus with that word, showing not only its different meanings but their relative frequency. With others, such as quirky, a sampling helps outline the idiomatic use, whether it typically refers to female or male subjects, and so on.

Butterfield points out how useful this can be to teachers of English (350 million Chinese are learning the language); OUP makes a pile of money internationally on English guides.

The other book is BBC “word expert” Susie Dent’s Words of the Year. It’s a bright and flippable guide to terms either freshly coined (McQualification, purrcast) or given new meaning (embuggerance, mosquito) in 2008. Some will survive, many will not. The book’s so timely published only because Dent had access to the Corpus.

It’s worrying to think that this review too may soon be part of a corpus — alongside such ephemera as blogs and tweets. By putting ourselves “out there”, we make as well as consume language. And the more we invest in cyberspace, the more we validate the use of mass-aggregation tools like the corpus, like Google, in the study of culture.

