IOS: Html parsing - how to ignore tags like a, li, etc. inside <p>
I am currently using Hpple to parse HTML, for example:
TFHpple *htmlParser = [TFHpple hppleWithHTMLData:[currentString dataUsingEncoding:NSUTF8StringEncoding]]; NSString *paragraphsXpathQuery = @"//p//text()"; NSArray *paragraphNodes = [htmlParser searchWithXPathQuery:paragraphsXpathQuery]; if ([paragraphNodes count] > 0) { NSMutableArray *tempArray = [NSMutableArray array]; for (TFHppleElement *element in paragraphNodes) { [tempArray addObject:[element content]]; } article.paragraphs = tempArray; } Thus, I get an array of paragraphs, and I can use NSString *result = [myArray componentsJoinedByString:@"\n\n"]; to compile it into a single text text with linear errors.
However, if html contains tags, they are interpreted as separate entities and will receive a line broken by themselves, so at the end of the day from a line like this:
<p>I went to the <a href="blablabla.html">shop</a> to get some milk!</a></p> <p>It was awesome.</p> I get this:
I went to the shop to get some milk! It was awesome! And of course, I would like to get this (ignore the other tags inside the p tag):
I went to the shop to get some milk! It was awesome! Can you help me?
NSString *HTMLTags = @"<[^>]*>"; //regex to remove any html tag NSString *htmlString = @"<html>bla bla</html>"; NSString *stringWithoutHTML = [hstmString stringByReplacingOccurrencesOfRegex:myregex withString:@""]; don't forget to include this in your code: #import "RegexKitLite.h" here is the link to download this API: http://regexkit.sourceforge.net/#Downloads
In XPath 1.0, you can do this in two steps :
Select all elements
p://pFor each
pelement selected (used as the initial context of the node) evaluate this:string()
Explanation
By definition, the result of using the standard XPath string() function for an element is the concatenation (in document order) of all its text descendants node.