I am using Python to parse urls. I have some success, but I'm trying to reduce the ambiguity. For example, Iโm assigned the following URL
"abbeycarsuk.com"
and my algorithm produces:
['abbey','car','suk'],['abbey','cars','uk']
Itโs clear that the second parsing is correct, but the first is also technically fair (apparently, โsukโ is the word in the dictionary that I use).
Which would help me a lot if there is a list of words that also contains the frequency / popularity of each word. I could use this in my algorithm, and then a second parsing would be chosen (since "uk" is clearly more common than "suk"). Does anyone know where I can find such a list? I found wordfrequency.info , but they charge for the data, and the free sample that they offer does not have enough words so that I can use it successfully.
Alternatively, I suppose I could download a large package (Gutenberg project?) And get the frequency values โโmyself, however, if such a data set already exists, it will make my life easier.
source share