Google search results: treat with a large pinch of salt

When deciding whether a certain construction or phrase is natural English or not, it makes sense to check how it is actually used. One way of doing this is to use corpora, but they can be quite complicated for laymen like me to use.
And as we have this wondeful 'corpus' that is the Internet, it is tempting to use search result counts, as well as web-based tools like Ngram Viewer, to make comparative assessments as to how language is used.
What's more, it's not only non-specialists like me who do this; Google search counts are not infrequently used by professional linguists at the linguistics blog Language Log, for example. But it turns out we need to be very careful.

Google web search

Commenting on the linguistics website Sentence First, I included the phrase:
a new set of figures are released
and wondered if that should have been 'is released'. Recently I've been investigating verb agreement with group words like 'range, series, number' etc, so I checked with Google search, which came up with:
  • "a new set of figures are released" - 12,300
  • "a new set of figures is released" - 496
This seemed pretty conclusive, so I then checked it with the present perfect, and got a rather different result
  • "a new set of figures have been released" - 8
  • "a new set of figures has been released" - 14,600
This seemed rather strange, especially as it shows only three pages for all those 14,600, which seems to be distinctly odd. So I went went to the next page, where the count went down to a mere 18, and the page count down to 2. So a 2/1 ratio rather than 1825/1. Then I checked my first set:

"a new set of figures are released"

This started at 12,300 with a page count of five, again rather suspiciou. By the time I got to the fourth page, that count had been reduced to 39, and the pages to four.

"a new set of figures is released"

This started at 496, with a page count of 8, and ended up at 36 on only four pages.

Revised figures

So we actually end up with the following
  • "a new set of figures are released" - 39
  • "a new set of figures is released" - 36
  • "a new set of figures have been released" - 8
  • "a new set of figures has been released" - 39
Not quite what I had started off with.

Google Books

I often prefer to check with Google Books, as these are edited and proofread, and should give a better idea of 'correct' Standard English, but something similar happens here, too.
I entered the phrase "than had been previously thought", and got an initial count of 26,200. On Page 18, it was still 26,200, but by the next (and last) page this had miraculously come down to 183!

Google site search

This is useful for checking with certain newspapers etc. But we apparently have the same problem. Again I checked the expression "than had been previously thought".
WebsiteInitial countFinal countInitial number
of pages
Final number
of pages
The BBC421742
The Guardian231622
The Telegraphn191722
The New York Times492743
A site search of the New York Times for "than previously thought" starts off at 6160 and ends up at 491. At the Guardian, on the other hand, site results for the same phrase start at 1550 and end up pretty close at 1540.

Case study

This all came about from reading and commenting on a post at the linguistics blog Sentence First, where the blogger, Stan Carey, was suggesting that the phrase "than previously thought" was rather overused, especially in science publications, and recommended doing a site search of for "previously thought" to see what he meant.
The initial result for that is 6960, but by the time I got to the final page (Page 58) it was still showing 6940, but as each page only shows 10 results, this seemed impossible. And sure enough, at the bottom, there was the comment:
In order to show you the most relevant results, we have omitted some entries very similar to the 580 already displayed. If you like, you can repeat the search with the omitted results included.
Which I did, giving me another two pages, i.e. 60 pages, and it's still showing 6940, ten times more than the number of page references.
So I went back to that Guardian count for "than previously thought", and sure enough the same thing had happened. At the top of the final page (54) it was showing 1550, but at the bottom 531. Repeating the process 'with the omitted results included' made no difference, 54 pages at ten results per page, yet still showing 1550 at the top.
I don't know quite what all this proves, except that Google search result counts should be treated with a large pinch of salt. For a technical explanation, have a look at the links for NTL World and Wikipedia.

