Word distribution problem

I have a large ~ 100 Gb word file and 4Gb limited memory. I need to calculate the distribution of words from this file. Now one option is to split it into pieces and sort each piece, and then merge it to calculate the word distribution. Is there any other way that this can be done faster? One idea is to try, but not sure how to implement it in order to return to the right solution.

thank

+3
source share
7 answers

You can build a Trie structure where each sheet (and some nodes) will contain a current counter. Since the words will intersect with each other, 4 GB should be sufficient to process 100 GB of data.

+3
source

Naively, I would just create a hash table until it reaches a certain limit in memory, and then sort it in memory and write it down. Finally, you can perform n-way merging of each fragment. At best, you will have 100/4 pieces or so, but probably a lot less if some words are more common than others (and how they are grouped).

- trie, . 256-way tree, . .

+2

, "trie" :

public class Trie : Dictionary<char, Trie>
{
    public int Frequency { get; set; }

    public void Add(string word)
    {
        this.Add(word.ToCharArray());
    }

    private void Add(char[] chars)
    {
        if (chars == null || chars.Length == 0)
        {
            throw new System.ArgumentException();
        }

        var first = chars[0];
        if (!this.ContainsKey(first))
        {
            this.Add(first, new Trie());
        }

        if (chars.Length == 1)
        {
            this[first].Frequency += 1;
        }
        else
        {
            this[first].Add(chars.Skip(1).ToArray());
        }
    }

    public int GetFrequency(string word)
    {
        return this.GetFrequency(word.ToCharArray());
    }

    private int GetFrequency(char[] chars)
    {
        if (chars == null || chars.Length == 0)
        {
            throw new System.ArgumentException();
        }

        var first = chars[0];
        if (!this.ContainsKey(first))
        {
            return 0;
        }

        if (chars.Length == 1)
        {
            return this[first].Frequency;
        }
        else
        {
            return this[first].GetFrequency(chars.Skip(1).ToArray());
        }
    }
}

:

var t = new Trie();

t.Add("Apple");
t.Add("Banana");
t.Add("Cherry");
t.Add("Banana");

var a = t.GetFrequency("Apple"); // == 1
var b = t.GetFrequency("Banana"); // == 2
var c = t.GetFrequency("Cherry"); // == 1

trie .

, , " ". , , trie , .

+2

, ? (.. ), , - . .

0

DBM. . , B +, .

0

? , :

  • word count.
  • word. (f.e. Progress).
  • SELECT .
  • , .
  • Otherwise, add it to the table.
0
source

If you use python, you can check the iter built-in function. It will read line by line from your file and will not cause memory problems. You should not "return" the value, but "drop" it. Here is an example that I used to read a file and get vector values.

def __iter__(self):  
     for line in open(self.temp_file_name):
         yield self.dictionary.doc2bow(line.lower().split())
0
source

Source: https://habr.com/ru/post/1773554/


All Articles