I don’t know how Google does it, but I can say how I did it when the client needs something like this:
It starts with an inverted index, as described in Avi. This is just a list of tables for each word in each document, a document identifier, a word and a score for the relevance of a word in that document. (Another approach is to index each type of word individually along with its position, but in this case it is not required).
From there it is even simpler than describing Avi - there is no need to do a separate search for each term. Aggregate operations with basic databases can easily be done in one go:
SELECT document_id, sum(score) total_score, count(score) matches FROM rev_index WHERE word IN ('david', 'john') GROUP BY document_id HAVING matches = 2 ORDER BY total_score DESC
This will return the identifiers of all documents that have ratings for both “David” and “John” (that is, both words appear), ordered with a certain approximation of relevance, and will take about the same time to complete regardless of whether how many or how few terms you are looking for, since the performance of IN not very dependent on the size of the target set, and it uses a simple count to determine if all terms were consistent or not.
Note that this simplified method simply adds a rating of “David” and “John” to determine overall relevance; he does not accept order / proximity / etc. names. Once again, I'm sure Google is doing this in its ratings, but my client did not need it.
source share