How to clear a webpage using C?

So, I wrote a website scraper program in C # using the HTML flexibility package. It was pretty straight forward. Even given the formatting inconsistencies on the web page, it still took me a couple of hours to work.

Now I have to re-implement this program in C so that it can be run in linux environment. This is a major nightmare.

I can pull the page, but when it comes to tracking it, to pull out the parts that interest me, I draw a lot of spaces. Initially, I was dead trying to implement a solution similar to my HTML Agility specification in C #, except for using Tidy and some other XML library so that I could keep my logic more or less the same.

It didnโ€™t work out so well. The XML library that I have access to does not support xpath support, and I cannot install it. So I resorted to trying to figure out a way to read the page, using string matching to find the data I needed. I cannot help but feel that there must be a better way to do this.

Here is what I have:

#define HTML_PAGE "codes.html" int extract() { FILE *html; int found = 0; char buffer[1000]; char searchFor[80], *cp; html = fopen(HTML_PAGE, "r"); if (html) { // this is too error prone, if the buffer cuts off half way through a section of the string we are looking for, it will fail! while(fgets(buffer, 999, html)) { trim(buffer); if (!found) { sprintf(searchFor, "<strong>"); cp = (char *)strstr(buffer, searchFor); if(!cp)continue; if (strncmp(cp + strlen(searchFor), "CO1", 3) == 0 || strncmp(cp + strlen(searchFor), "CO2", 3) == 0) { got_code(cp + strlen(searchFor)); } } } } fclose(html); return 0; } got_code(html) char *html; { char code[8]; char *endTag; struct _code_st *currCode; int i; endTag = (char *)strstr(html, "</strong>"); if(!endTag)return; sprintf(code, "%.7s", html); for(i=0 ; i<Data.Codes ; i++) if(strcasecmp(Data.Code[i].Code, code)==0) return; ADD_TO_LIST(currCode, _code_st, Data.Code, Data.Codes); currCode->Code = (char *)strdup(code); printf("Code: %s\n", code); } 

The above does not work properly. I get a lot of codes that interest me, but as I mentioned above, if the buffer goes down in the wrong places, I skip some.

I tried to just read the whole html fragment that interests me in the line, but I could not figure out how to do this - I could not display the codes.

Does anyone know how I can solve this problem?

EDIT . I thought about it a little more. Is there a way I can look at a file and search for the end of each โ€œblockโ€ of text? Will I parse and set the size of the buffer that was before I read it? Do I need another file pointer to the same file? This (hopefully) will prevent the buffer clipping problem in awkward places.

+5
source share
1 answer

Well, thatโ€™s why after long blows of my head against the wall, trying to come up with a way to make my code higher, I decided to try a slightly different approach.

Since I knew that the data on the page that I scraped was on one huge line, I changed my code to search the file until it found it. Then I advance in turn, looking for the right blocks. This worked surprisingly well, and once I read the code reading some of the blocks, it was easy to make minor changes to HTML inconsistencies. The part that took the longest time was to figure out how to help out as soon as I got to the end of the line, and I solved it, having reached the maximum, to make sure there is another block to read.

Here is my code (which is ugly but functional):

 #define HTML_PAGE "codes.html" #define START_BLOCK "<strong>" #define END_BLOCK "</strong>" int extract() { FILE *html; int found = 0; char *line = NULL, *endTag, *startTag; size_t len = 0; ssize_t read; char searchFor[80]; html = fopen(HTML_PAGE, "r"); if (html) { while((read = getline(&line, &len, html)) != -1) { if (found) // found line with codes we are interested in { char *ptr = line; size_t nlen = strlen (END_BLOCK); while (ptr != NULL) { sprintf(searchFor, START_BLOCK); startTag = (char *)strstr(ptr, searchFor); if(!startTag) { nlen = strlen (START_BLOCK); ptr += nlen; continue; } if (strncmp(startTag + strlen(searchFor), "CO1", 3) == 0 || strncmp(startTag + strlen(searchFor), "CO2", 3) == 0) got_code(startTag + strlen(searchFor), code); else { nlen = strlen (START_BLOCK); ptr += nlen; continue; } sprintf(searchFor, END_BLOCK); ptr = (char *)strstr(ptr, searchFor); if (!ptr) { found = 0; break; } nlen = strlen (END_BLOCK); ptr += nlen; if (ptr) { // look ahead to make sure we have more to pull out sprintf(searchFor, END_BLOCK); endTag = (char *)strstr(ptr, searchFor); if (!endTag) { break; } } } found = 0; break; } // find the section of the downloaded page we care about // the next line we read will be a blob containing the html we want if (strstr(line, "wiki-content") != NULL) { found = 1; } } fclose(html); } return 0; } got_code(char *html) { char code[8]; char *endTag; struct _code_st *currCode; int i; endTag = (char *)strstr(html, "</strong>"); if(!endTag)return; sprintf(code, "%.7s", html); for(i=0 ; i<Data.Codes ; i++) if(strcasecmp(Data.Code[i].Code, code)==0) return; ADD_TO_LIST(currCode, _code_st, Data.Code, Data.Codes); currCode->Code = (char *)strdup(code); printf("Code: %s\n", code); } 

Not as elegant as my C # program, but at least it returns all the necessary information.

+5
source

Source: https://habr.com/ru/post/1204883/


All Articles