Htmlcxx c ++ crawling html

I read a lot of different problems from different people regarding libraries for HTML crawl. I decided to go with htmlcxx because it looks simple and it is in the Ubuntu repository. Anyway, playing with htmlcxx, I tried to perform a simple task and grab the text between the header tags. Using iterator, it-> text () returns the tag itself, and it-> textClosing () returns the closing tag. My question is: how can I get BETWEEN data with tags? I mean, there must be a way, why create a library to bypass html and not have this function? If someone can point me in the right direction, I would appreciate it.

You can check what I have done with svn so far: svn co svn: //yunices.dyndns.org/repository/nich/trunk.

or browse through websvn: https://yunices.dyndns.org/

Here is the specific fragment in question:

void node::get_headings() { tree<htmlcxx::HTML::Node>::iterator it = dom.begin(); tree<htmlcxx::HTML::Node>::iterator end = dom.end(); for (; it != end; ++it) { static const boost::regex expression("[h|H][1-6]"); if(boost::regex_search(it->tagName(), expression)) { it->parseAttributes(); std::cout << it->text() << "<=>" << it->closingText() << std::endl; std::map<std::string, std::string> pairs = it->attributes(); for ( std::map<std::string, std::string>::const_iterator iter = pairs.begin(); iter != pairs.end(); ++iter ) { std::cout << iter->first << ":" << iter->second << "\n"; } } } } 
+4
source share
3 answers

You can add this method to Node.h to get the contents contained between the tags (passing the original html string as an argument):

 inline unsigned int contentLength() const { this->mLength - this->mText.length() - this->mClosingText.length(); } inline std::string content(const std::string& html) const { return html.substr(this->mOffset + this->mText.length(), this->contentLength()); } 

This works well Dave, thanks, actually there was no bracket, I just threw it on one line.

 inline std::string content(const std::string& html) const { return html.substr(this->mOffset + this->mText.length(), this->mLength - (this->mText.length() + this->mClosingText.length())); } 
+2
source

In most DOM libraries (and therefore in htmlcxx, if I read the code correctly), the tag text is actually node (or in the case of something like

<p> bla <p>blubb</p> blah </p>

more than one node).

You just need to iterate over all the children of the tag and verify that it is neither a comment nor a tag.

+4
source

The following function demonstrates a method for accessing child content.

 std::string get_child_content( tree<HTML::Node> const & dom, tree<HTML::Node>::iterator const & parent ) { std::string result; for ( unsigned i=0; i<dom.number_of_children(parent); i++ ) { tree<HTML::Node>::iterator it = dom.child(parent,i); if ( !it->isTag() && !it->isComment() ) result += it->text(); } return result; } 

Keep in mind, however, that, as @filmor noted, HTML can represent multiple levels of children for any tag. The function that I provided only captures direct children.

Here is an example of how you can use this and influence some HTML example ...

 cout << it->text(); // display the opening tag cout << get_child_content(dom,it); // display the contents cout << it->closingText(); // display the closing tag 

Raw HTML ...

 <h2>hello <span>w</span>orld</h2> 

Result (note that the space and its contents are missing) ...

 <h2>hello orld</h2> 
+2
source

Source: https://habr.com/ru/post/1340822/


All Articles