Reading a text file word by word

Question

Reading a text file word by word

I have a text file containing only lowercase letters and no punctuation except spaces. I would like to know the best way to read char char file in such a way that if the next char is a space, it means the end of one word and the beginning of a new word. that is, when each character is read, it is added to the line, if the next char is a space, then the word is passed to another method and reset until the reader reaches the end of the file.

I am trying to do this with a StringReader, something like this:

public String GetNextWord(StringReader reader) { String word = ""; char c; do { c = Convert.ToChar(reader.Read()); word += c; } while (c != ' '); return word; }

and put the GetNextWord method in the while loop to the end of the file. Does this approach make sense or are there better ways to achieve this?

+4

c #

Matt Mar 16 '12 at 15:58

source share

9 answers

If you are interested in good performance even in very large files, you should take a look at the new (4.0) MemoryMappedFile -Class .

For instance:

 using (var mappedFile1 = MemoryMappedFile.CreateFromFile(filePath)) { using (Stream mmStream = mappedFile1.CreateViewStream()) { using (StreamReader sr = new StreamReader(mmStream, ASCIIEncoding.ASCII)) { while (!sr.EndOfStream) { var line = sr.ReadLine(); var lineWords = line.Split(' '); } } } }

From MSDN:

A memory mapped file maps the contents of a file into the application's logical address space. Files with memory mapping allow programmers to work with extremely large files, since the memory can be managed simultaneously, and they allow full, random access to the file without the need for looking. Memory mapped files can also be shared by several processes.
CreateFromFile methods create a memory map file from the specified path or FileStream of an existing file on disk. changes are automatically propagated to disk when the file is not displayed.
CreateNew methods create a memory-mapped file that does not map to an existing file on disk; and are suitable for creating shared memory for interprocess communication (IPC).
A memory file is associated with a name.
You can create several representations of a file with memory mapping, including views of parts of a file. You can map the same part of a file from more than one address to create parallel memory. For two remain parallel, they must be created from the same map file. Creating two file associations of the same file with two views does not provide concurrency.

+6

Tim schmelter Mar 16 '12 at 16:28

source share

First of all: StringReader reads from a string that is already in memory. This means that you have to download the entire input file before it can read it, which strikes the target of reading several characters at a time; it may also be undesirable or even impossible if the input is very large.

The class to read from a text stream (which is an abstraction over a data source) is StreamReader , and you might want to use this instead. Now StreamReader and StringReader exchanging the StringReader abstract base class, which means that if you encode TextReader , then you can get the best of both worlds.

TextReader open interface will really support your sample code, so I would say that this is a reasonable starting point. You just need to fix one bright mistake: there is no check for Read return -1 (which means the end of the available data).

+2

Jon Mar 16 '12 at 16:06

source share

All on one line, here you go (assuming ASCII and maybe not a 2gb file):

 var file = File.ReadAllText(@"C:\myfile.txt", Encoding.ASCII).Split(new[] { ' ' });

Returns an array of strings that you can iterate over and do whatever you need.

+1

Bryan crosby Mar 16 '12 at 16:07

source share

If you want to read its whitout spliting line - for example the line is too long, so that you may encounter an OutOfMemoryException, you should do it like this (using StreamReader):

 while (sr.Peek() >= 0) { c = (char)sr.Read(); if (c.Equals(' ') || c.Equals('\t') || c.Equals('\n') || c.Equals('\r')) { break; } else word += c; } return word;

+1

Maticdiba Aug 28 '14 at 9:09

source share

This is a method that will separate your words while they are separated by a space or more than 1 space (for example, two spaces) /

 StreamReader streamReader = new StreamReader(filePath); //get the file string stringWithMultipleSpaces= streamReader.ReadToEnd(); //load file to string streamReader.Close(); Regex r = new Regex(" +"); //specify delimiter (spaces) string [] words = r.Split(stringWithMultipleSpaces); //(convert string to array of words) foreach (String W in words) { MessageBox.Show(W); }

0

Andrew Mar 16 '12 at 16:08

source share

I would do something like this:

 IEnumerable<string> ReadWords(StreamReader reader) { string line; while((line = reader.ReadLine())!=null) { foreach(string word in line.Split(new [1] {' '}, StringSplitOptions.RemoveEmptyEntries)) { yield return word; } } }

If you use reader.ReadAllText, it loads the entire file into your memory so you can get an OutOfMemoryException and many other problems.

0

Eugene Mar 16 '12 at 16:21

source share

I created a simple console program according to your exact requirement with the files you specified, it needs to be easily run and verified. Attach the code. Hope this helps

 static void Main(string[] args) { string[] input = File.ReadAllLines(@"C:\Users\achikhale\Desktop\file.txt"); string[] array1File = File.ReadAllLines(@"C:\Users\achikhale\Desktop\array1.txt"); string[] array2File = File.ReadAllLines(@"C:\Users\achikhale\Desktop\array2.txt"); List<string> finalResultarray1File = new List<string>(); List<string> finalResultarray2File = new List<string>(); foreach (string inputstring in input) { string[] wordTemps = inputstring.Split(' ');// .Split(' '); foreach (string array1Filestring in array1File) { string[] word1Temps = array1Filestring.Split(' '); var result = word1Temps.Where(y => !string.IsNullOrEmpty(y) && wordTemps.Contains(y)).ToList(); if (result.Count > 0) { finalResultarray1File.AddRange(result); } } } foreach (string inputstring in input) { string[] wordTemps = inputstring.Split(' ');// .Split(' '); foreach (string array2Filestring in array2File) { string[] word1Temps = array2Filestring.Split(' '); var result = word1Temps.Where(y => !string.IsNullOrEmpty(y) && wordTemps.Contains(y)).ToList(); if (result.Count > 0) { finalResultarray2File.AddRange(result); } } } if (finalResultarray1File.Count > 0) { Console.WriteLine("file array1.txt contians words: {0}", string.Join(";", finalResultarray1File)); } if (finalResultarray2File.Count > 0) { Console.WriteLine("file array2.txt contians words: {0}", string.Join(";", finalResultarray2File)); } Console.ReadLine(); } }

0

Ankuser 18 sept. '17 at 7:19

source share

This code will extract words from a text file based on the Regex template. You can try playing with other templates to see what works best for you.

  StreamReader reader = new StreamReader(fileName); var pattern = new Regex( @"( [^\W_\d] # starting with a letter # followed by a run of either... ( [^\W_\d] | # more letters or [-'\d](?=[^\W_\d]) # ', -, or digit followed by a letter )* [^\W_\d] # and finishing with a letter )", RegexOptions.IgnorePatternWhitespace); string input = reader.ReadToEnd(); foreach (Match m in pattern.Matches(input)) Console.WriteLine("{0}", m.Groups[1].Value); reader.Close();

0

40-love 25 sept. '17 at 17:43

source share

eouw0o83hf · Accepted Answer · 2012-03-16T16:02:36+0000

There is a much better way to do this: string.Split() : if you read the entire line, C # can automatically split it into each space:

 string[] words = reader.ReadToEnd().Split(' ');

Now the words array contains all the words in the file, and you can do whatever you want with them.

In addition, you may need to learn the File.ReadAllText method in the System.IO namespace - this can make your life much easier to import files into text.

Edit: I assume this assumes that your file is not shockingly large; as long as all things can be reasonably read in memory, this will work most easily. If you have gigabytes of read data, you probably want to avoid this. I would suggest using this approach if possible: it makes better use of the structure that you have at your disposal.

Reading a text file word by word

More articles: