Testing software for word count program in character stream

I made a program in which data input should be a stream of characters, and the program counts non-whitespace characters and words. a word is defined as a stream of characters separated by a space character. so here is the program.

#include <stdio.h> #include<ctype.h> #include <stdbool.h> #include<iso646.h> int main(void) { unsigned long int wordcount = 0,charcount = 0, count=1; int ch; bool flag, prev; while ((ch = getchar()) != EOF) { if(isgraph(ch)) flag=true; else flag=false; if(flag) charcount++; if(count ==1) prev = flag; if(count != 1) { if(prev and (not flag)) wordcount++; prev = flag; } count++; } if((ch == EOF) and flag) wordcount++; printf("\nnumber of words counted are %lu \n", wordcount); printf("\nnumber of characters counted are %lu \n", charcount); return 0; } 

I have now tested this program on simple sentences. But just for practice, I want to do detailed software testing on this. So how can I do this? Am I just giving more suggestions? I tried to give a few paragraphs from some of the novels that I found in the gutenberg project. what else can i do here? Also can I increase the effectiveness of this program?

+2
source share
2 answers

There are various basic tests:

  • Empty file
  • File with only one empty
  • File with one not empty only
  • File with one empty and one empty
  • File with one empty and one empty
  • File with multiple spaces
  • File with multiple non-spaces only
  • File with multiple spaces followed by unlocked

And so it goes on ... it's border testing; make sure the code works correctly under boundary conditions.

Your assignment of a value from getchar() to unsigned long int (now asked in the question) is unusual. Since the return value is positive for the regular character and negative (EOF) for the end of the file or error, it is normal to assign it a signed simple int .

Your test ch == EOF after the loop is redundant; the only way out of the loop is when the condition is true.

Using <iso646.h> and keywords (macros) and and not also unusual.

More often than not, people donโ€™t put the code on the same line as the open bracket of the block.

You can increase charcount in the if block, where you set flag = true; . You can use the else block instead of if (count != 1) . In fact, AFAICT, your code:

 if(count ==1) prev = flag; if(count != 1) { if(prev and (not flag)) wordcount++; prev = flag; } 

can be written as:

 if (count > 1 and prev and (not flag)) wordcount++; prev = flag; 

The description "number of counted characters" is not strictly accurate; this is the number of graphic (non-empty, uncontrolled) characters that you report. This is probably at the hyper-nipping end of the fussy scale, though (along with the observation that โ€œword countโ€ is a special quantity, and it should be โ€œeatโ€ rather than โ€œeatโ€).

It's a little unusual to start your count from 1, not from scratch. It seems that the record is "greater than the number of raw characters read in the program", which is an unusual amount for the record. Typically, you also initialize it to 0 and modify the test I rewrote to read:

 if (count != 0 and prev and (not flag)) 

(You can use count != 0 or count > 0 ; for an unsigned value, the terms are equivalent.)

You may be able to simplify your conditional expressions by initializing the prev appropriately (possibly to false ).

+3
source

Get in the habit of setting the constant that you are testing on the left, as in

while (EOF! = (ch = getchar ()))

... since it will save you countless hours, just when you can afford to fail, when you accidentally dial = when you mean ==. Since you cannot assign a variable to a constant, the compiler will mark your error and save your butt.

In my experience, as soon as you get used to reading such code, you will find it much faster to find what is tested when it is next to if (, while (than hunting for it somewhere in the body of the test. This is especially true if you have a long list of tests, such as opening files, sockets, etc., and then allocating memory via malloc () to store the file data.

PS: after some research, there are a few basic things CS 101 are worth mentioning ...

Fourthly, here you have a classic case - in this case, because you have a requirement to look for one character, even on the first pass through the while () loop - for sowing the while () loop. The solution is to set the while () loop with a simple if () block, which performs a single pass using the same logic as the while () loop. (FYI, a while () is an infinite set of if () s with a termination condition)

The correct way to do this is as shown. A win can pull out all the if () testers, checking if this is the first pass through the while () loop each time through this loop. The 1st pass is processed by the if () tag, which precedes the while () loop.

2nd, I found your variable names uninformative. This does not mean that they were "wrong", but probably someone who is trying to maintain your code will fight too. In my experience, as you understand the code better, variable names are getting better and better. Use this as a litmus test, do you understand the problem, have a good solution and know why.

Thirdly, if you find that you initialize the variable in main () to 1, it should raise a flag in your mind about the correct flow control, since PassKnt is now set to 1. Also, as a rule, you want to increase the loop counter in the end of the / if / while loop, not at the beginning of it. Again, this should make you question your logic.

NOTE. Notepad by default saves in Unicode format. If you use Notepad to create test files for this program, be sure to save it in ANSI format.

I left it because it makes it easier to understand the program, but IsGraphFlg is not needed here. Instead of assigning IsGraphFlg to WasGraphFlg at the bottom of the loop, this can be done at the bottom of the upper and lower halves of the if-else block, since the contexts provide the same information as IsGraphFlag.

  while (EOF != (ch = fgetc(pFile))) { if(isgraph(ch)) { IsGraphFlg=true; charcount++; } else { // this char is whitespace, last char was part of a word IsGraphFlg=false; if(WasGraphFlg) { wordcount++; } } WasGraphFlg = IsGraphFlg; PassKnt++; } 

You may also notice that PassKnt now has no purpose and is no longer needed.

It was suggested that isgraph () is optimal, but when I created the bool array and initialized it using isgraph (), the code ran (from the memory buffer, which is ~ 10X as fast as from the file on this Dell XPS 8500) in total 2/3 times - 9.25 measures instead of 14.75 per character. This is a completely optional optimization - albeit a significant one.

 bool IsGph[256]; for(i=0; i<sizeof(IsGph); i++) { IsGph[i] = isgraph((unsigned char)i); } 

When using if (isgraph (i)) is replaced by if (IsGph [i]) in the loop of the main character and word.

Code updated 12/30/2012

 // Word_Counter.cpp : Defines the entry point for the console application. // #include "stdafx.h" #include "stdafx.h" #include <stdio.h> #include <time.h> #include <stdlib.h> #include <memory.h> #include <locale> #define UCHAR unsigned char #define dbl double #define LLONG __int64 #define PROCESSOR_HZ ((LLONG) 3400000000) #pragma warning(disable : 4996) // // function prototypes FILE *OpenFiles (int *FileSz, char *FileName); // ----------------------------------------------------------------------- FILE *OpenFiles (int *FileSz, char *FileName) { FILE *pFile=NULL; if (NULL == (pFile = fopen ((char *)FileName, "r+t" ))) { printf ( "Can't open %s\n", FileName ); return NULL; } else { fseek(pFile,0,SEEK_END); *FileSz = ftell(pFile); rewind(pFile); printf("\nFile size is %i", *FileSz); return pFile; } } // ----------------------------------------------------------------------- int _tmain(int argc, char *argv[]) { bool IsGph[256]; UCHAR *p, *pBuff=NULL; int WrdKnt=0,CharKnt=0; int i, j, FileSz, LoopKnt=3500; time_t Etime=0,start=0, Eclocks=0; FILE *pFile=NULL; bool WasGraphFlg=false; // Initialize boolean array to detect printable characters for(i=0; i<sizeof(IsGph); i++) { IsGph[i] = isgraph((unsigned char)i); } if(NULL == (pFile = OpenFiles(&FileSz, (char *)argv[1]))) { return 0; } // --- Process out of buffer, not stdin ------------------------------- pBuff = (unsigned char *)calloc(FileSz, sizeof(char)); fread(pBuff, sizeof(char), FileSz, pFile); start = clock(); for(i=LoopKnt; i; i--) { p= pBuff; CharKnt=0; WrdKnt=0; for(j=FileSz; j; j--) { if(IsGph[*p++]) { CharKnt++; WasGraphFlg = true; } else { // this char is whitespace, and if(WasGraphFlg) { // last char was part of word ? WrdKnt++; } WasGraphFlg = false; } } } Etime = clock() - start; Eclocks= Etime * PROCESSOR_HZ/(LLONG) CLOCKS_PER_SEC; printf("\nElapsed time for %10i loops was %10i milliseconds", LoopKnt, Etime); printf("\nCPU cycles consumed per char were %2f\n", (dbl)Eclocks/(dbl)((LLONG)FileSz*(LLONG)LoopKnt)); printf("\n%i words counted per loop", WrdKnt); printf("\n%i chars counted per loop\n", CharKnt); getchar(); return _fcloseall(); free(pBuff); } 

If you have problems with the command line argument with the file name, in the "Projects-> Properties-> Configuration Properties-> General" section in Visual Studio change "Unicode" to a badly mistaken "multibyte" character set. You can always look at argv [1] in the debugger to find out what is actually in argv [].

0
source

Source: https://habr.com/ru/post/1341738/


All Articles