Replace a long list of words in a large text file

Question

Replace a long list of words in a large text file

I need a quick method to work with a large text file

I have 2 files, a large text file (~ 20 GB) and another text file containing ~ 12 million combo word lists

I want to find all combo words in the first text file and replace it with another combined word (underlined combined word)

Example "Computer Information"> "Replace with>" Computer_Information "

I use this code, but the performance is very poor (I am testing an Hp G7 server with 16 GB RAM and a 16-core processor)

public partial class Form1 : Form { HashSet<string> wordlist = new HashSet<string>(); private void loadComboWords() { using (StreamReader ff = new StreamReader(txtComboWords.Text)) { string line; while ((line = ff.ReadLine()) != null) { wordlist.Add(line); } } } private void replacewords(ref string str) { foreach (string wd in wordlist) { // ReplaceEx(ref str,wd,wd.Replace(" ","_")); if (str.IndexOf(wd) > -1) str.Replace(wd, wd.Replace(" ", "_")); } } private void button3_Click(object sender, EventArgs e) { string line; using (StreamReader fread = new StreamReader(txtFirstFile.Text)) { string writefile = Path.GetFullPath(txtFirstFile.Text) + Path.GetFileNameWithoutExtension(txtFirstFile.Text) + "_ReplaceComboWords.txt"; StreamWriter sw = new StreamWriter(writefile); long intPercent; label3.Text = "initialing"; loadComboWords(); while ((line = fread.ReadLine()) != null) { replacewords(ref line); sw.WriteLine(line); intPercent = (fread.BaseStream.Position * 100) / fread.BaseStream.Length; Application.DoEvents(); label3.Text = intPercent.ToString(); } sw.Close(); fread.Close(); label3.Text = "Finished"; } } }

any idea to do this job in a reasonable amount of time

thanks

+4

string list c # replace text-processing

Hossein adib Dec 23 '11 at 20:14

source share

2 answers

Jeremy mcgee · Answer 1 · 2011-12-23T20:27:00+0000

At first glance, the approach you took looks great - it should work fine, and there is nothing obvious that can cause, for example, a lot of garbage.

The main thing is that I think that you will use only one of these sixteen cores: there is nothing else to divide the load into the other fifteen.

I think the easiest way to do this is to split a large 20 GB file into sixteen pieces, then parse each of the pieces together, and then combine the pieces again. The extra time spent splitting and reassembling the file should be minimal compared to the 16x gain involved in scanning these sixteen pieces together.

In general terms, one way to do this could be:

  private List<string> SplitFileIntoChunks(string baseFile) { // Split the file into chunks, and return a list of the filenames. } private void AnalyseChunk(string filename) { // Analyses the file and performs replacements, // perhaps writing to the same filename with a different // file extension } private void CreateOutputFileFromChunks(string outputFile, List<string> splitFileNames) { // Combines the rewritten chunks created by AnalyseChunk back into // one large file, outputFile. } public void AnalyseFile(string inputFile, string outputFile) { List<string> splitFileNames = SplitFileIntoChunks(inputFile); var tasks = new List<Task>(); foreach (string chunkName in splitFileNames) { var task = Task.Factory.StartNew(() => AnalyseChunk(chunkName)); tasks.Add(task); } Task.WaitAll(tasks.ToArray()); CreateOutputFileFromChunks(outputFile, splitFileNames); }

One tiny nit: move the calculation of the stream length from the loop, you only need to get it.

EDIT: also incorporate the idea of @Pavel Gatilov to invert the logic of the inner loop and search for every word in a string in a list of 12 million.

Pavel gatilov · Answer 2 · 2011-12-23T20:35:08+0000

A few ideas:

I think it will be more efficient to split each line into words and see if each of several words appears in your list of words. 10 hashset searches are better than millions of substring searches. If you have compound keywords, enter the appropriate indexes: one that contains all the individual words that appear in real keywords, and the other that contains all real keywords.
Perhaps loading strings in StringBuilder better for replacement.
Updating progress after, say, 10,000 lines, and not after each.
The process in the background thread. It will not be so fast, but the application will be responsible.
Parallelize the code as Jeremy suggested.

UPDATE

Here is an example code that demonstrates the idea of a headword index:

 static void ReplaceWords() { string inputFileName = null; string outputFileName = null; // this dictionary maps each single word that can be found // in any keyphrase to a list of the keyphrases that contain it. IDictionary<string, IList<string>> singleWordMap = null; using (var source = new StreamReader(inputFileName)) { using (var target = new StreamWriter(outputFileName)) { string line; while ((line = source.ReadLine()) != null) { // first, we split each line into a single word - a unit of search var singleWords = SplitIntoWords(line); var result = new StringBuilder(line); // for each single word in the line foreach (var singleWord in singleWords) { // check if the word exists in any keyphrase we should replace // and if so, get the list of the related original keyphrases IList<string> interestingKeyPhrases; if (!singleWordMap.TryGetValue(singleWord, out interestingKeyPhrases)) continue; Debug.Assert(interestingKeyPhrases != null && interestingKeyPhrases.Count > 0); // then process each of the keyphrases foreach (var interestingKeyphrase in interestingKeyPhrases) { // and replace it in the processed line if it exists result.Replace(interestingKeyphrase, GetTargetValue(interestingKeyphrase)); } } // now, save the processed line target.WriteLine(result); } } } } private static string GetTargetValue(string interestingKeyword) { throw new NotImplementedException(); } static IEnumerable<string> SplitIntoWords(string keyphrase) { throw new NotImplementedException(); }

The code shows the main ideas:

We divided both key phrases and processed strings into equivalent units that can be effectively compared: words.
We keep a dictionary that for any word quickly gives us links to all key phrases containing the word.
Then we apply your original logic. However, we do not do this for all 12 million key phrases, but rather for a very small subset of key phrases that have at least one word intersection with the processed line.

I will leave the rest of the implementation to you.

There are several problems in the code:

SplitIntoWords should actually normalize words to canonical form. It depends on the required logic. In the simplest case, you are likely to be fine with splitting spaces and smoothing. But it may happen that you need a morphological correspondence - it will be more difficult (this is very close to the tasks of full-text search).
For speed, it would probably be better if the GetTargetValue method was called once for each passphrase before processing the input.
If many of your key phrases match words, you will still have a significant amount of extra work. In this case, you will need to save the position of the keywords in the key phrases in order to use the calculation of the distance between words to exclude irrelevant key phrases when processing the input line.
Also, I'm not sure if StringBuilder is actually faster in this particular case. You should experiment with StringBuilder and string to find out the truth.
This is a sample after all. The design is not very good. I would think of extracting some classes using consistent interfaces (e.g. KeywordsIndex ).

Replace a long list of words in a large text file

More articles: