Word distribution problem

Question

Word distribution problem

I have a large ~ 100 Gb word file and 4Gb limited memory. I need to calculate the distribution of words from this file. Now one option is to split it into pieces and sort each piece, and then merge it to calculate the word distribution. Is there any other way that this can be done faster? One idea is to try, but not sure how to implement it in order to return to the right solution.

thank

+3

algorithm

user352951 Nov 08 '10 at 1:18

source share

7 answers

Naively, I would just create a hash table until it reaches a certain limit in memory, and then sort it in memory and write it down. Finally, you can perform n-way merging of each fragment. At best, you will have 100/4 pieces or so, but probably a lot less if some words are more common than others (and how they are grouped).

- trie, . 256-way tree, . .

+2

I GIVE CRAP ANSWERS 08 . '10 1:32

, "trie" :

public class Trie : Dictionary<char, Trie>
{
    public int Frequency { get; set; }

    public void Add(string word)
    {
        this.Add(word.ToCharArray());
    }

    private void Add(char[] chars)
    {
        if (chars == null || chars.Length == 0)
        {
            throw new System.ArgumentException();
        }

        var first = chars[0];
        if (!this.ContainsKey(first))
        {
            this.Add(first, new Trie());
        }

        if (chars.Length == 1)
        {
            this[first].Frequency += 1;
        }
        else
        {
            this[first].Add(chars.Skip(1).ToArray());
        }
    }

    public int GetFrequency(string word)
    {
        return this.GetFrequency(word.ToCharArray());
    }

    private int GetFrequency(char[] chars)
    {
        if (chars == null || chars.Length == 0)
        {
            throw new System.ArgumentException();
        }

        var first = chars[0];
        if (!this.ContainsKey(first))
        {
            return 0;
        }

        if (chars.Length == 1)
        {
            return this[first].Frequency;
        }
        else
        {
            return this[first].GetFrequency(chars.Skip(1).ToArray());
        }
    }
}

:

var t = new Trie();

t.Add("Apple");
t.Add("Banana");
t.Add("Cherry");
t.Add("Banana");

var a = t.GetFrequency("Apple"); // == 1
var b = t.GetFrequency("Banana"); // == 2
var c = t.GetFrequency("Cherry"); // == 1

trie .

, , " ". , , trie , .

+2

Enigmativity 08 . '10 2:00

, ? (.. ), , - . .

0

DennyRolling 08 . '10 1:22

DBM. . , B +, .

0

tchrist 08 . '10 1:26

? , :

word count.
word. (f.e. Progress).
SELECT .
, .
Otherwise, add it to the table.

0

Thevs Nov 08 '10 at 14:06

source share

If you use python, you can check the iter built-in function. It will read line by line from your file and will not cause memory problems. You should not "return" the value, but "drop" it. Here is an example that I used to read a file and get vector values.

def __iter__(self):  
     for line in open(self.temp_file_name):
         yield self.dictionary.doc2bow(line.lower().split())

0

Serbay Apr 20 '15 at 14:36

source share

Vitalii Fedorenko · Accepted Answer · 2010-11-08T01:32:51+0000

You can build a Trie structure where each sheet (and some nodes) will contain a current counter. Since the words will intersect with each other, 4 GB should be sufficient to process 100 GB of data.

Word distribution problem

More articles: