How to crop HTML string without leaving it wrong?

I need to display the first N (e.g. 50 or 100) characters from the entire html line. I have to display a well-formed html.If I use a simple substring that will make me have the wrong html string For example.

Example line: "<html><body><a href="http://foo.com">foo</a></body></html>"

trucated string: "<html><body><a href="http://foo.com">foo<"

This will lead me to html distortion :(

Any ideas on how to achieve this?

+4
source share
3 answers

You can try using the HTML Agility Pack - it will parse the HTML for you, but you will need to figure out how to produce the truncated version yourself. This should make things a lot easier, though.

+3
source

Parse the HTML in the DOM tree. Start with the deepest / innermost elements and

  • delete the contents of the innermost node or node if it has no content
  • check the length of the string.

Rinse, spread, repeat.

This may truncate your string to an empty string if the desired length is small enough.

For extra hits, you can try to remove the attributes of the nodes along the way.

+1
source

I saw how some forum systems simply added </b> </u> </i> </s> after each individual post. You could approach this in a similar way.

Of course, its ugly, and he would not fix that trailing <

This is by far the easiest method. The best thing is actually to generate a tree and ... kick the nodes until you meet this requirement.

0
source

Source: https://habr.com/ru/post/1302210/


All Articles