Comparing a string list with an available dictionary / thesaurus

I have a program (C #) that generates a list of strings (permutations of the original string). Most strings are a random grouping of source letters, as expected (i.e. etam, aemt, team). I want to find one line in the list, which is the current English word, programmatically. I need a thesaurus / dictionary to search and compare each line. Anyone knows the available resource. I am using VS2008 in C #.

+4
source share
2 answers

You can download a list of words from the Internet (say, one of the files mentioned here: http://www.outpost9.com/files/WordLists.html ), then do it quickly:

// Read words from file. string [] words = ReadFromFile(); Dictionary<String, List<String>> permuteDict = new Dictionary<String, List<String>>(StringComparer.OrdinalIgnoreCase); foreach (String word in words) { String sortedWord = new String(word.ToArray().Sort()); if (!permuteDict.ContainsKey(sortedWord)) { permuteDict[sortedWord] = new List<String>(); } permuteDict[sortedWord].Add(word); } // To do a lookup you can just use String sortedWordToLook = new String(wordToLook.ToArray().Sort()); List<String> outWords; if (permuteDict.TryGetValue(sortedWordToLook, out outWords)) { foreach (String outWord in outWords) { Console.WriteLine(outWord); } } 
+2
source

You can also use the Wiktionary. The MediaWiki API (Wikionary uses MediaWiki) allows you to query a list of article titles. In wiktionary, article titles are (among other things) words in a dictionary. The only catch is that foreign words are also in the dictionary, so sometimes you can get the β€œwrong” matches. Of course, your user will also need Internet access. You can get help and information about the api at: http://en.wiktionary.org/w/api.php

Here is an example URL of your request:

 http://en.wiktionary.org/w/api.php?action=query&format=xml&titles=dog|god|ogd|odg|gdo 

This returns the following xml:

 <?xml version="1.0"?> <api> <query> <pages> <page ns="0" title="ogd" missing=""/> <page ns="0" title="odg" missing=""/> <page ns="0" title="gdo" missing=""/> <page pageid="24" ns="0" title="dog"/> <page pageid="5015" ns="0" title="god"/> </pages> </query> </api> 

In C #, you can use System.Xml.XPath to get the parts you need (pages using pageid). These are "real words."

I wrote an implementation and tested it (using a simple β€œdog” example above). He returned only the "dog" and "god." You should check it in more detail.

 public static IEnumerable<string> FilterRealWords(IEnumerable<string> testWords) { string baseUrl = "http://en.wiktionary.org/w/api.php?action=query&format=xml&titles="; string queryUrl = baseUrl + string.Join("|", testWords.ToArray()); WebClient client = new WebClient(); client.Encoding = UnicodeEncoding.UTF8; // this is very important or the text will be junk string rawXml = client.DownloadString(queryUrl); TextReader reader = new StringReader(rawXml); XPathDocument doc = new XPathDocument(reader); XPathNavigator nav = doc.CreateNavigator(); XPathNodeIterator iter = nav.Select(@"//page"); List<string> realWords = new List<string>(); while (iter.MoveNext()) { // if the pageid attribute has a value // add the article title to the list. if (!string.IsNullOrEmpty(iter.Current.GetAttribute("pageid", ""))) { realWords.Add(iter.Current.GetAttribute("title", "")); } } return realWords; } 

Name it as follows:

 IEnumerable<string> input = new string[] { "dog", "god", "ogd", "odg", "gdo" }; IEnumerable<string> output = FilterRealWords(input); 

I tried using LINQ to XML, but I am not familiar with it, so it was a pain, and I abandoned it.

+1
source

Source: https://habr.com/ru/post/1301057/


All Articles