Get character offsets for elements in jsoup

I need to map jsoup elements back to specific character offsets in the source HTML. In other words, if I have HTML that looks like this:

Hello <br/> World 

I need to know that "Hello" starts at offset 0 and has a length of 6 characters, <br/> starts at offset 6 and has a length of 5 characters, etc.

I could not find the recipient in the javadoc element that returns this information. Can I get it?

+6
source share
1 answer

I do not believe that Jsoup has this functionality. This question seems closer to lexical analysis than HTML parsing.

I would write a grammar and then write lexer against this grammar, which would label HTML and provide the offsets you are looking for.

First, parse the document with Jsoup to make sure it is valid HTML.

Then, lexically analyze the document against grammar. Grammar might look like this:

 Document := {optional-opening-tag} | {literal} {optional-opening-tag} | {optional-closing-tag} optional-opening-tag := ["<" {literal} ">" {optional-opening-tag}|{literal} ] | "" optional-closing-tag := "</ {literal} ">" | "" literal := any string of characters not beginning with whitespace, or containing "<" 

Insert each token that you find in the object where the token is stored, the index of the first character and the length.

0
source

Source: https://habr.com/ru/post/919989/


All Articles