Availability of a list with English words (including frequencies)?

I am using Python to parse urls. I have some success, but I'm trying to reduce the ambiguity. For example, Iโ€™m assigned the following URL

"abbeycarsuk.com" 

and my algorithm produces:

 ['abbey','car','suk'],['abbey','cars','uk'] 

Itโ€™s clear that the second parsing is correct, but the first is also technically fair (apparently, โ€œsukโ€ is the word in the dictionary that I use).

Which would help me a lot if there is a list of words that also contains the frequency / popularity of each word. I could use this in my algorithm, and then a second parsing would be chosen (since "uk" is clearly more common than "suk"). Does anyone know where I can find such a list? I found wordfrequency.info , but they charge for the data, and the free sample that they offer does not have enough words so that I can use it successfully.

Alternatively, I suppose I could download a large package (Gutenberg project?) And get the frequency values โ€‹โ€‹myself, however, if such a data set already exists, it will make my life easier.

+4
source share
4 answers

There is an extensive article on this subject written by Peter Norwig (head of Google research) that contains working examples in Python and is pretty easy to understand. In the article, along with the data used in the sample programs (some excerpts from Google ngram data) can be found here . A complete set of Google ngrams for several languages โ€‹โ€‹can be found here (free to download if you live in the eastern US).

+7
source

As you mentioned, "corpus" is a search keyword.

E. G. Here is a good list of resources:

http://www-nlp.stanford.edu/links/statnlp.html

(scroll down)

+2
source

Try using a smaller dictionary first, and a smaller dictionary will contain more frequently used words. Then, if that fails, you can use its more competitive dictionary, which includes words such as "suk".

Then you could ignore word frequency analysis, however you would hit your overhead by adding another smaller dictionary.

You may be able to use the link he posted in the comments as a small dictator

edit . You also have a free service where you can download a list of 5,000 words used

+1
source

http://ucrel.lancs.ac.uk/bncfreq/flists.html

This is possibly the list you want. I think you can reduce its size to increase performance if necessary.

Here is a long list. More information here .

+1
source

Source: https://habr.com/ru/post/1491474/


All Articles