Sunday, April 7, 2013

Finding collocations and language in context using web tools

In this post I take a look at finding collocations and lexis in context, using free online corpus-based web tools. First we'll look at what lexis is and what a corpus is, and how they can help us consolidate our vocabulary.

What is lexis?

Last year I started using the New English File Advanced coursebook with students. After every reading text it has a section called Lexis in context. 'Don't worry', I told my students, 'Lexis is just a trendy new word for vocabulary.' (For some reason I'm not too keen on trendy new terms - I still prefer Phrasal verb to Multipart verb!). But it turns out I was oversimplifying things a bit.
Here's Leo Selivan writing on the British Council Teaching English website:
Vocabulary is typically seen as individual words (often presented in lists) whereas lexis is a somewhat wider concept and consists of collocations, chunks and formulaic expressions.
As I understand it, it seems that we store language in our memories in chunks (groups of words) rather than single words, and this is how we build up the sentences we make. I'm sure you already say things like for example or excuse me without thinking too much about the individual words. And hopefully you do the same with at least some dependent prepositions - phrases like it depends on, I apologise for should come fairly automatically. The 'lexical approach' to language learning just takes this a bit further.
Experiment - Look back at my first paragraph and see if you can see any 'lexical chunks', groups of words that seem to go naturally together. I can make out at least six. When you've found a few, click here to show my ideas.

Computers and the Internet

Finding and cataloguing these chunks from real life speech and writing has become much easier with computers and the Internet, and especially through the use of so-called corpora. This chunking is (I think, I'm no expert) the basis for programs like Google Translate, which looks for patterns rather than particular words or grammar constructions.

A case in point

I had a very good example of this the other day. I'm currently preparing a post on homophones (words that sound the same but which have different meanings and are spelt differently, like buy, by, bye). As I typed some example sentences into Google Docs, it red-lined some of my false homophones, for example:
  • He strummed a few cords on the guitar.
  • The hurricane reeked havoc in the area.
  • I always try and give him a really wide birth when I see him
A simple spell checker wouldn't have found those, as they are all genuine words, and nor would a grammar checker, as they are all grammatically correct. And indeed it turns out (that chunk again) that 'Google Docs uses contextual spelling suggestions to try to figure out which words you meant to type based on the content in your document' (Google Drive Help). In other words they look for patterns or chunks. And in a way that's what the lexical approach does.

What are corpora?

Corpora is the plural of corpus, which is Latin for body. In linguistics, a corpus is basically a collection (body) of samples of real-world texts (written and spoken) stored on computer. The most famous are the British national Corpus (BNC) and the Corpus of Contemporary American English (COCA), but there are many others.
The growth of the Internet has brought these corpora into the reach of everyone, not just the experts, and the Internet itself can be seen as one big corpus. The usefulness of corpora for teaching or studying a language is that we can find how these collocations and chunks are used in the real world, if we have the right tools. And in this post I want to look at some free online tools that anyone can use.

First stop: the dictionary

The first thing to do is to look up your target word in a good learner's dictionary. Fortunately, four of the main British ones are free to use online. At least three of them (Cambridge, Macmillan and Longman) are corpus-based, but my own favourite is the Oxford Advanced Learner's Dictionary, mainly because its search is easier to use: enter a phrasal verb, an idiom or a past form of a verb and it'll take you straight there. It also has useful Usage notes, and some illustrated pages.
Good learner's dictionaries will often tell you the main collocates and will usually give you one or two example sentences, more for commonly used words.
You can access all these dictionaries and more from my gadget on the right. But there are also some special tools which can help us find collocations, and tell us which are the most common, statistically. Many of these tools I found out about at a website for teachers - Leoxicon (link below).

Finding collocations

For Better English - GDEX Dictionary

This only seems to work on single words; it found no matches for wait for, for example. It lists common collocates in different grammatical functions with example sentences taken from its corpus. The grammatical labels can be a bit daunting, for example for the adjective harsh, it gives, amongst other things:
Grammatical labelCollocateSample sentence (extracts)
pp_ofconditionin the harshest conditions
modifierundulyunduly harsh
np_adj_comp_ofseemit seems a bit harsh
modifiesrealityThe harsh reality ...
adj_comp_ofseemwhich seemed quite harsh
You can see the page here. This looks a bit technical, but I would just look at the collocates and the example sentences without worrying too much about the grammatical jargon, which I imagine is really aimed at linguists. Here is what they have for unlikely:

Just the Word

Just the Word is very easy to use. Enter a word and it'll come up with a list of collocates divided into grammatical functions. If you try with unlikely, for example, you'll get a list starting with *unlikely* N (unlikely + noun, eg: unlikely place) and *unlikely* PREP (unlikely + preposition, eg: unlikely in). These grammatical labels are perhaps a bit easier to understand than the GDEX ones.
I'm interested in finding the most common intensifiers for unlikely, so what I need to do is scroll down to ADV *unlikely* (adverb + unlikely), where I see the longest green lines are for highly and most, which means they are the most common collocations: EG
Some features of Just the Word:
  • JWT works with phrasal verbs like wait for, but not with phrases such as bad temper.
  • Unlike GDEX, JWT gives you a ranking figure for each collocation, so you can see how different collocations compare.
  • JWT is good for people who are specifically interested in how words collocate in British English, as it seems to be based on the British National Corpus (BNC).
  • Clicking on any of the collocations brings up a page of example sentences from the BNC.


With Netspeak you have to make certain choices first, as shown in this screenshot from their website, but the results are very easy to read, and you can use it with phrases as well as single words and phrasal verbs.
The one I've used most so far is ?. You can use it:
  • before a word or expression - ? unlikely eg
  • after a word or expression - unlikely ? eg
  • between two words or expressions - In the unlikely ? that eg
Similarly you can use ... to find more than one word in those positions, as well as the other options shown. Just go to their page to try them out.
Clicking on the + sign will bring up some example sentences, which seem to be from Google Search.
Tip - if the options disappear, click on the i sign
For collocations this is probably my favourite, perhaps because of the nice, clean interface, but also for its versatility. But I'm not so keen on its example sentences, which seem a bit hit-and-miss.


This is very easy to use and finds missing words, which you indicate with an asterisk (*). It is good for finding common modifiers and intensifiers, for example. Here are searches for: It was * unlikely and he has a * temper
Clicking on the words it finds gives quite a lot of information about those words, but we don't see examples of our new phrase in a greater context. It is 'powered by' Wordnik, an online dictionary, but I'm not sure where the examples come from.

Note - Oxford Collocations Dictionary

There are two online collocations dictionaries claiming to be connected with the Oxford Collocations Dictionary. Neither of them, as far as I know, have anything to do with Oxford publications. The WOT Web of Trust gives one of them a red (warning) traffic-light, and has no information on the other, so I'm not going to endorse them by linking to them here.

Finding chunks - vocabulary in context

Once you've found words that collocate with your new word, it's a good idea to see how these combinations are used in context. We've already seen that For Better English, gives an example sentence for each collocation, Just the Word gives us example sentences from the British National Corpus and you can also see example sentences at Netspeak (probably from Google Search). But there are also other sites which can show us many more examples from corpora.

Rather than show single collocates, Fraze.It shows the target word(s) in example sentences, mostly from newspapers. You can do quite a bit of tweaking, for example by specifying tenses, type of sentence (form - statement, question or negative) or at the beginning or end of the sentence (rule), UK or US (zone), and context - business, entertainment etc. Clicking on Advanced Search brings up various possible combinations.
It appears to work on single words, phrasal verbs and phrases. It also has a dictionary, synonyms, web definitions and a translator. Here is an example with wait for in past continuous

British National Corpus - Simple

The simple version of the BNC is very easy to use. Enter a word or phrase and it will show you up to fifty examples from the collection of real use at the BNC. Some of it is a bit literary or official sounding, and you have to be a little bit careful with some of the spoken examples, as the grammar is not always 100% standard.
BNC gives you a maximum random fifty of the examples it finds. Sometimes it has fewer, and occasionally it simply draws a blank, as with the expression harsh but fair (a standard expression of which there are plenty of examples elsewhere). Here is the search result for bad temper - EG

Google site search

This is probably what I use most for finding phrases in context. With Google site search, you can limit your search to one website, for example your favourite English-language newspaper, magazine or webpage. I've written more about this, and written a small tool to make Google site search even easier to use here.

Other vocabulary tools

There are also a couple of other tools I use quite frequently.

Google Ngram Viewer

Ngram gives you graphs of how words and expressions are used over time. It's good for checking trends, comparing collocates and comparing British and American usage. It's based on a corpus of a proportion of the books that have been digitised at Google Books and is easy and fun to use. This graph shows how highly has become the most common intensifier with unlikely, but only quite recently:
Just enter the expressions, separated by a comma (but no space). EG. Scroll down the page and you'll find links to Google Books for various periods.
With a little tweaking we can compare British and American English in the same graph. This one shows how the description flight attendant took over from air hostesss, a process which seems to have started rather earlier in the US than in the UK:
To do this, add one of these to the expressions you are looking up - :eng_us_2012 for American books and :eng_us_2012 for British books. EG

Google Books

I sometimes prefer to look for expressions and check usage in Google Books rather than doing a standard Google search. Books have been edited and proofread, and so grammatical constructions and usages are probably a bit more reliable. Here are the search results for "Have you ever": EG

More Words

This is really a site for finding and checking words for Scrabble® and other word games, but it's also good for finding words beginning and ending with certain combinations of letters, for example prefixes and suffixes.
The easiest way to access it is just to Google, for example,"words ending in dom", and click on More Words, which will be in the first three or so search results. If it gives you the choice of "by frequency" go for that, otherwise open the list and scroll down and click on "by how common the words are" and then you'll get them listed by frequency. EG

More advanced

These sites are probably more suitable for linguists and grammarians, but they aren't quite as complicated to use as they look at first sight, and seem to be open to anyone to use.
I haven't spent enough time on them, however, to say much about them and they're probably a bit beyond the scope of this post.

LexTutor Corpus Concordance

At first sight this website looks rather complicated, but it's possible to do quite simple corpus searches by ignoring all the options (apart from choosing your corpus - I use the combined British one).

BNC, COCA and others at Brigham Young University

You can do more advanced corpus searches at this site. Again it looks a bit complicated at first, but you don't have to use all the bells and whistles. They ask you to register after five or so searches, but it's free and there don't seem to be any problems.

For teachers

I first came across the lexical approach on Leoxicon after Leo had commented on one of my posts. Leo Selivan is a teacher and trainer with the British Council, and seemingly one of the leading advocates of the lexical approach. You can find lots of information about using collocation tools and corpora, together with links to various articles about the lexical approach at his website (links below).
Here I've been mainly concerned about how the lexical approach can be used in relation to learning vocabulary in context, but Leo has a lot more to say on how it can affect grammar teaching.



  • Leoxicon - a blog dedicated to the use of corpora in the teaching of English to foreign students.
  • The lexical approach - links on Leoxicon to relevant articles
  • Lexical Tools - short discusion on some of the tools I've mentioned, with links for teachers to suggested classroom activities.
  • The Guardian - article by Leo Selivan

Language tools

More advanced access to corpora



Hairy Scot said...

Hi WW,

Would love to have a chat with you, either via chat media or email but reluctant to post my addresses on a public thread.
I am also on FB, Skype, Windows Live Messenger and Trillian.

Warsaw Will said...

Hi HS,
I wondered if it was you. The email address I use for this blog is will followed by a dot, followed by randomidea followed by an at-sign, followed by (sorry about that - it's to hopefully avoid automatic spam)