Get in the habit of setting the constant that you are testing on the left, as in
while (EOF! = (ch = getchar ()))
... since it will save you countless hours, just when you can afford to fail, when you accidentally dial = when you mean ==. Since you cannot assign a variable to a constant, the compiler will mark your error and save your butt.
In my experience, as soon as you get used to reading such code, you will find it much faster to find what is tested when it is next to if (, while (than hunting for it somewhere in the body of the test. This is especially true if you have a long list of tests, such as opening files, sockets, etc., and then allocating memory via malloc () to store the file data.
PS: after some research, there are a few basic things CS 101 are worth mentioning ...
Fourthly, here you have a classic case - in this case, because you have a requirement to look for one character, even on the first pass through the while () loop - for sowing the while () loop. The solution is to set the while () loop with a simple if () block, which performs a single pass using the same logic as the while () loop. (FYI, a while () is an infinite set of if () s with a termination condition)
The correct way to do this is as shown. A win can pull out all the if () testers, checking if this is the first pass through the while () loop each time through this loop. The 1st pass is processed by the if () tag, which precedes the while () loop.
2nd, I found your variable names uninformative. This does not mean that they were "wrong", but probably someone who is trying to maintain your code will fight too. In my experience, as you understand the code better, variable names are getting better and better. Use this as a litmus test, do you understand the problem, have a good solution and know why.
Thirdly, if you find that you initialize the variable in main () to 1, it should raise a flag in your mind about the correct flow control, since PassKnt is now set to 1. Also, as a rule, you want to increase the loop counter in the end of the / if / while loop, not at the beginning of it. Again, this should make you question your logic.
NOTE. Notepad by default saves in Unicode format. If you use Notepad to create test files for this program, be sure to save it in ANSI format.
I left it because it makes it easier to understand the program, but IsGraphFlg is not needed here. Instead of assigning IsGraphFlg to WasGraphFlg at the bottom of the loop, this can be done at the bottom of the upper and lower halves of the if-else block, since the contexts provide the same information as IsGraphFlag.
while (EOF != (ch = fgetc(pFile))) { if(isgraph(ch)) { IsGraphFlg=true; charcount++; } else { // this char is whitespace, last char was part of a word IsGraphFlg=false; if(WasGraphFlg) { wordcount++; } } WasGraphFlg = IsGraphFlg; PassKnt++; }
You may also notice that PassKnt now has no purpose and is no longer needed.
It was suggested that isgraph () is optimal, but when I created the bool array and initialized it using isgraph (), the code ran (from the memory buffer, which is ~ 10X as fast as from the file on this Dell XPS 8500) in total 2/3 times - 9.25 measures instead of 14.75 per character. This is a completely optional optimization - albeit a significant one.
bool IsGph[256]; for(i=0; i<sizeof(IsGph); i++) { IsGph[i] = isgraph((unsigned char)i); }
When using if (isgraph (i)) is replaced by if (IsGph [i]) in the loop of the main character and word.
Code updated 12/30/2012
// Word_Counter.cpp : Defines the entry point for the console application. // #include "stdafx.h" #include "stdafx.h" #include <stdio.h> #include <time.h> #include <stdlib.h> #include <memory.h> #include <locale> #define UCHAR unsigned char #define dbl double #define LLONG __int64 #define PROCESSOR_HZ ((LLONG) 3400000000) #pragma warning(disable : 4996) // // function prototypes FILE *OpenFiles (int *FileSz, char *FileName); // ----------------------------------------------------------------------- FILE *OpenFiles (int *FileSz, char *FileName) { FILE *pFile=NULL; if (NULL == (pFile = fopen ((char *)FileName, "r+t" ))) { printf ( "Can't open %s\n", FileName ); return NULL; } else { fseek(pFile,0,SEEK_END); *FileSz = ftell(pFile); rewind(pFile); printf("\nFile size is %i", *FileSz); return pFile; } } // ----------------------------------------------------------------------- int _tmain(int argc, char *argv[]) { bool IsGph[256]; UCHAR *p, *pBuff=NULL; int WrdKnt=0,CharKnt=0; int i, j, FileSz, LoopKnt=3500; time_t Etime=0,start=0, Eclocks=0; FILE *pFile=NULL; bool WasGraphFlg=false; // Initialize boolean array to detect printable characters for(i=0; i<sizeof(IsGph); i++) { IsGph[i] = isgraph((unsigned char)i); } if(NULL == (pFile = OpenFiles(&FileSz, (char *)argv[1]))) { return 0; } // --- Process out of buffer, not stdin ------------------------------- pBuff = (unsigned char *)calloc(FileSz, sizeof(char)); fread(pBuff, sizeof(char), FileSz, pFile); start = clock(); for(i=LoopKnt; i; i--) { p= pBuff; CharKnt=0; WrdKnt=0; for(j=FileSz; j; j--) { if(IsGph[*p++]) { CharKnt++; WasGraphFlg = true; } else { // this char is whitespace, and if(WasGraphFlg) { // last char was part of word ? WrdKnt++; } WasGraphFlg = false; } } } Etime = clock() - start; Eclocks= Etime * PROCESSOR_HZ/(LLONG) CLOCKS_PER_SEC; printf("\nElapsed time for %10i loops was %10i milliseconds", LoopKnt, Etime); printf("\nCPU cycles consumed per char were %2f\n", (dbl)Eclocks/(dbl)((LLONG)FileSz*(LLONG)LoopKnt)); printf("\n%i words counted per loop", WrdKnt); printf("\n%i chars counted per loop\n", CharKnt); getchar(); return _fcloseall(); free(pBuff); }
If you have problems with the command line argument with the file name, in the "Projects-> Properties-> Configuration Properties-> General" section in Visual Studio change "Unicode" to a badly mistaken "multibyte" character set. You can always look at argv [1] in the debugger to find out what is actually in argv [].