So, I wrote a website scraper program in C # using the HTML flexibility package. It was pretty straight forward. Even given the formatting inconsistencies on the web page, it still took me a couple of hours to work.
Now I have to re-implement this program in C so that it can be run in linux environment. This is a major nightmare.
I can pull the page, but when it comes to tracking it, to pull out the parts that interest me, I draw a lot of spaces. Initially, I was dead trying to implement a solution similar to my HTML Agility specification in C #, except for using Tidy and some other XML library so that I could keep my logic more or less the same.
It didnโt work out so well. The XML library that I have access to does not support xpath support, and I cannot install it. So I resorted to trying to figure out a way to read the page, using string matching to find the data I needed. I cannot help but feel that there must be a better way to do this.
Here is what I have:
#define HTML_PAGE "codes.html" int extract() { FILE *html; int found = 0; char buffer[1000]; char searchFor[80], *cp; html = fopen(HTML_PAGE, "r"); if (html) { // this is too error prone, if the buffer cuts off half way through a section of the string we are looking for, it will fail! while(fgets(buffer, 999, html)) { trim(buffer); if (!found) { sprintf(searchFor, "<strong>"); cp = (char *)strstr(buffer, searchFor); if(!cp)continue; if (strncmp(cp + strlen(searchFor), "CO1", 3) == 0 || strncmp(cp + strlen(searchFor), "CO2", 3) == 0) { got_code(cp + strlen(searchFor)); } } } } fclose(html); return 0; } got_code(html) char *html; { char code[8]; char *endTag; struct _code_st *currCode; int i; endTag = (char *)strstr(html, "</strong>"); if(!endTag)return; sprintf(code, "%.7s", html); for(i=0 ; i<Data.Codes ; i++) if(strcasecmp(Data.Code[i].Code, code)==0) return; ADD_TO_LIST(currCode, _code_st, Data.Code, Data.Codes); currCode->Code = (char *)strdup(code); printf("Code: %s\n", code); }
The above does not work properly. I get a lot of codes that interest me, but as I mentioned above, if the buffer goes down in the wrong places, I skip some.
I tried to just read the whole html fragment that interests me in the line, but I could not figure out how to do this - I could not display the codes.
Does anyone know how I can solve this problem?
EDIT . I thought about it a little more. Is there a way I can look at a file and search for the end of each โblockโ of text? Will I parse and set the size of the buffer that was before I read it? Do I need another file pointer to the same file? This (hopefully) will prevent the buffer clipping problem in awkward places.