You can use replaceAll () with wildcards

Question

You can use replaceAll () with wildcards

Good morning. I understand that there are many questions regarding replacement and replaceAll() , but I have not seen this.

What needs to be done is to parse the line (containing the actual html for the point), after I see the second <p> instance in the line, I want to delete everything that starts with and ends with; until I see the next </p>

To do the second part, I was hoping to use something along the lines of s.replaceAll("&*;","")

This does not work, but hopefully it gets my point of view that I am looking to replace everything that starts with and ends with:

+4

java string html

Deslyxia Sep 11 '12 at 19:54

source share

2 answers

Do you want to:

 s.replaceAll("&.*?;","");

But do you really want to parse HTML this way? You might be better off using an XML parser.

+1

Jon lin Sep 11 '12 at 20:03

source share

Brian · Accepted Answer · 2012-09-11T20:13:08+0000

You should probably leave the parsing DOM parser ( see this question ). I can almost guarantee that you will need to do this to find the text in the <p> tags.

For the replacement logic, String.replaceAll uses regular expressions that can fulfill the desired match.

The "wildcard" in the regular expressions you want is the expression .* . Using your example:

 String ampStr = "This &escape;String"; String removed = ampStr.replaceAll("&.*;"); System.out.println(removed);

This outputs This String . This is due to the fact that . represents any character, and * means "this character is 0 or more times." So. .* Basically means "any number of characters." However, feeding him:

 "This &escape;String &anotherescape;Extended"

probably won't do what you want and This Extended will output. To fix this, you specify exactly what you want to look for instead of a character . . This is done using [^;] , which means "any character that is not a semicolon:

 String removed = ampStr.replaceAll("&[^;]*;");

This gives performance advantages over &.*?; for non-matching lines, so I highly recommend using this version, especially since not all HTML files will contain the &abc; token , and the version &.*?; can have tremendous neck performance as a result.

You can use replaceAll () with wildcards

More articles: