XML Parser Stuck in Special Characters Despite Encoding

Here is a situation:

I get data from the XML API. This data sometimes contains a special apostrophe character that causes my parser to crash. This failure only occurs when reading data from a local file. When I read data from the stream, there is no failure, but I also do not get the DOM tree: it exits without notifying me.

Below you will find a list of the attempts we made to make everything work:

// Does not work var web = new WebClient(); web.Encoding = Encoding.UTF8; var response = web.DownloadString("http://thetvdb.com/api/apikey/series/" + show.TVDBID + "/"); var tree = XDocument.Parse(response); // Works var doc = new XmlDocument(); doc.Load("C:\\Test\\test.xml"); var response = doc.InnerXml; var tree = XDocument.Parse(response); // Works var xmlDoc = XDocument.Parse(File.ReadAllText("c:\\Test\\test.xml", System.Text.Encoding.UTF8)); var xmlDoc = XDocument.Load("C:\\Test\\test.xml"); var tree = xmlDoc; // Does not work var web = new WebClient(); web.Encoding = Encoding.UTF8; web.DownloadFile("http://thetvdb.com/api/apikey/series/" + show.TVDBID + "/", "C:\\test.xml"); var tree = XDocument.Load("C:\\test.xml"); // Does not work var web = new WebClient(); web.Encoding = Encoding.UTF8; var data = web.DownloadData("http://thetvdb.com/api/apikey/series/" + show.TVDBID + "/"); var response = Encoding.UTF8.GetString(data); var tree = XDocument.Parse(response); 

I determine if something works depending on whether the breakpoint in the first line of this loop reaches:

 if (root != null) { var lastupdate = root.Element("Series").Element("lastupdated").Value; foreach (var epi in tree.Descendants("Episode")) { var season = epi.Element("SeasonNumber").Value; // Breakpoint here } } 

Accidents happen when the parser encounters this apostrophe: enter image description here

When I replace this character with my manually entered apostrophe or ' , an error no longer occurs and continues until the next one. When I look at the source page of the API request in firefox and chrome, it tells me that the UTF-8 encoding and code examples in the API wiki also show UTF-8 in the header.

That's where I am so far. Any ideas?

I just noticed that my result string from the API request contains only the <Series></Series> according to the XML / text / HTML visualizer during debugging and not <Episode></Episode> . However, when I execute the same request in my browser, it shows me both. Is it possible? When I look at him through the Postman, he shows episodes.

Update:

When I use Unicode as an encoding, I do not receive any warnings and I can completely parse the local XML file! I'm not an expert on coding, are there any flaws in using Unicode?

When using unicode for a data stream, I get a bunch of Asian characters.

+4
source share
4 answers

This is due to the encoding of your data. This allows you to receive raw binary files (so no encoding problems).

 WebClient myWebClient = new WebClient(); byte[] data = myWebClient.DownloadData(uri); string xmlContents = Encoding.UTF8.GetString(data); 

EDIT After your recent development in Unicode, I would say that the data is indeed encoded in UTF-16. Unicode is not an encoding type, it is essentially just an encoded character set, i.e. A set of characters and a mapping between characters and integer code points representing them. When you "encode something in Unicode," it usually means UTF-16. In any case, I am glad that your problem has been resolved!

+1
source

Try

 var tree = XElement.Parse(response); foreach(var epi in tree.Descendants("Episode")) { ... } 

If Data is your root node and there are no buried episodes, you can replace the descendants with Elements.

0
source

' is html escape for specific browsers. Use &apos; , this is the correct xml escape sequence.

It looks like you got a “smart quote” from one of those annoying Microsoft products that change all your quotes and apostrophes to curly ones that claim to be in ISO-8859-1 / Latin-1 but really Win-1252 with no plane C0. If so, only Win-1252 encoding will parse this document for you. Or you can turn off curly words for ordinary, and everything will be fine.

0
source

I found a solution, and it is somewhat controversial. The episodes were not received because my API line was incomplete: it should have ended with /all/ , but I must have forgotten it somewhere and copied it from this point forward. This was the last place I was looking for.

By changing the API call, I can now get all the episodes. There are no more coding errors (although I haven’t changed anything for this), and now he has already received 4,000 episodes, so I assume that the rest will remain without problems.

Someone made this wiki community: I'm not sure if this status is still justified, as it was a localized issue. I learned a lot about XML / API from these conversations, although thanks to everyone involved!

0
source

Source: https://habr.com/ru/post/1487669/


All Articles