Convert NSString with Unicode characters to valid HTML

I get a string from an API that has anchored tags, so I create an NSAttributedString from it and display it in a UITextView to support supported links.

The problem is that the input string is invalid HTML, so it has unescaped Unicode characters in it. Such things as:

  • HORIZONTAL ELLIPSIS Unicode: U + 2026, UTF-8: E2 80 A6
  • EM DASH Unicode: U + 2014, UTF-8: E2 80 94

While I was dealing with these specific cases, I am worried about any other Unicode characters that come in that I currently don't know about.

Example:

 NSString *fromAPI = @"Reagan \U2014 saying"; NSDictionary *options = @{NSDocumentTypeDocumentAttribute : NSHTMLTextDocumentType}; NSData *data = [fromAPI dataUsingEncoding:NSUTF8StringEncoding allowLossyConversion:NO]; NSAttributedString *attributedString = [[NSAttributedString alloc] initWithData:data options:options documentAttributes:nil error:nil]; 

This displays in a UITextView as: enter image description here

How to get it to correctly display em dash and another unicode?

+6
source share
2 answers

Found it looks like HTML will not display unicode unless you add it to <head>

 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 
+7
source

What I was going to suggest (if I understood the question correctly) was to use a regular expression or something to add the escape character \U0000FE0E or just \UFE0E to the end of all unescaped Unicode characters, for example:

 NSString *fromAPI = @"Reagan \U2014 saying"; NSString *convertedFromAPI = @"Reagan \U2014\UFE0E saying"; 

But I think what you are doing at the moment makes more sense.

-1
source

Source: https://habr.com/ru/post/970952/


All Articles