I read a lot of different problems from different people regarding libraries for HTML crawl. I decided to go with htmlcxx because it looks simple and it is in the Ubuntu repository. Anyway, playing with htmlcxx, I tried to perform a simple task and grab the text between the header tags. Using iterator, it-> text () returns the tag itself, and it-> textClosing () returns the closing tag. My question is: how can I get BETWEEN data with tags? I mean, there must be a way, why create a library to bypass html and not have this function? If someone can point me in the right direction, I would appreciate it.
You can check what I have done with svn so far: svn co svn: //yunices.dyndns.org/repository/nich/trunk.
or browse through websvn: https://yunices.dyndns.org/
Here is the specific fragment in question:
void node::get_headings() { tree<htmlcxx::HTML::Node>::iterator it = dom.begin(); tree<htmlcxx::HTML::Node>::iterator end = dom.end(); for (; it != end; ++it) { static const boost::regex expression("[h|H][1-6]"); if(boost::regex_search(it->tagName(), expression)) { it->parseAttributes(); std::cout << it->text() << "<=>" << it->closingText() << std::endl; std::map<std::string, std::string> pairs = it->attributes(); for ( std::map<std::string, std::string>::const_iterator iter = pairs.begin(); iter != pairs.end(); ++iter ) { std::cout << iter->first << ":" << iter->second << "\n"; } } } }
source share