Problem with representing Bag of Words

Basically I have a dictionary containing all the words of my dictionary in the form of keys, and all with 0 as the value.

To process the document in the form of a bag with the representation of words, I used to copy this dictionary with the corresponding IEqualityComparer and simply checked whether the dictionary contains every word in the document and increases its key.

To get an array of words with a bag, I just used the ToArray method.

It seemed that everything worked fine, but I was told that the dictionary does not guarantee the same order of keys, so the resulting arrays can represent the words in a different order, which makes it useless.

My current idea to solve this problem is to copy all the dictionary keys into an ArrayList, create an array of the required size, and then use the indexOf method to list the arrays to populate the array.

So my question is, is there a better way to solve this problem, it seems to me rude ... and will I have problems because of IEqualityComparer?

+3
source share
6 answers

Let me see if I understand this problem. You have two documents D1 and D2, each of which contains a sequence of words taken from the famous dictionary {W1, W2 ... Wn}. You want two displays indicating the number of occurrences of each word in each document. So for D1 you might have

W1 --> 0
W2 --> 1
W3 --> 4

, D1 , , "W3 W2 W3 W3 W3". , D2 "W2 W1 W2",

W1 --> 1
W2 --> 2
W3 --> 0

[0, 1, 4] [1, 2, 0], , .

, , / .

, .

vector1 = (from pair in map1 orderby pair.Key select pair.Value).ToArray();
vector2 = (from pair in map2 orderby pair.Key select pair.Value).ToArray();

.

, ?

+4

, .

Regex, :

var words=Regex
    .Matches(input,@"\w+")
    .Cast<Match>()
    .Where(m=>m.Success)
    .Select(m=>m.Value);

:

var map=words.GroupBy(w=>w).Select(g=>new{word=g.Key,freqency=g.Count()});

GroupBy, IEqualityComparer, .

, :

map.Select(a=>a.frequency)

, map .

?

+2
+1

- , , , , . GetWordCount() .

WordCounter     {

= ();

    public void CountWords(string text)
    {
        if (text != null && text != string.Empty)
        {
            text = text.ToLower();
            string[] words = text.Split(' ');
            if (dictionary.ContainsKey(words[0]))
            {
                if (text.Length > words[0].Length)
                {
                    text = text.Substring(words[0].Length + 1);
                    CountWords(text);
                }

            }
            else
            {
                int count = words.Count(
                    delegate(string s)
                    {
                        if (s == words[0]) { return true; }
                        else { return false; }
                    });
                dictionary.Add(words[0], count);
                if (text.Length > words[0].Length)
                {
                    text = text.Substring(words[0].Length + 1);
                    CountWords(text);
                }

            }
        }
    }

    public int[] GetWordCount(string text)
    { 
        CountWords(text);
        return dictionary.Values.ToArray<int>();
    }


}
0

:

SortedDictionary<string, int> dic = new SortedDictionary<string, int>();

            for (int i = 0; i < 10; i++)
            {
                if (dic.ContainsKey("Word" + i))
                    dic["Word" + i]++;
                else
                    dic.Add("Word" + i, 0);
            }

            //to get the array of words:
            List<string> wordsList = new List<string>(dic.Keys);
            string[] wordsArr = wordsList.ToArray();

            //to get the array of values
            List<int> valuesList = new List<int>(dic.Values);
            int[] valuesArr = valuesList.ToArray();
0

If all you are trying to do is calculate the similarity to cosine, you do not need to convert your data into arrays up to 20,000, especially considering that the data is likely to be sparse, with most of the records being null.

During file processing, save the output of the file to the dictionary entered by the key. Then, to calculate the point product and quantities, you repeat the words in the complete list of words, look for the word in each of the ouptut data files and use the found value if it exists, and zero if it is not.

0
source

Source: https://habr.com/ru/post/1735529/


All Articles