You should probably leave the parsing DOM parser ( see this question ). I can almost guarantee that you will need to do this to find the text in the <p> tags.
For the replacement logic, String.replaceAll uses regular expressions that can fulfill the desired match.
The "wildcard" in the regular expressions you want is the expression .* . Using your example:
String ampStr = "This &escape;String"; String removed = ampStr.replaceAll("&.*;"); System.out.println(removed);
This outputs This String . This is due to the fact that . represents any character, and * means "this character is 0 or more times." So. .* Basically means "any number of characters." However, feeding him:
"This &escape;String &anotherescape;Extended"
probably won't do what you want and This Extended will output. To fix this, you specify exactly what you want to look for instead of a character . . This is done using [^;] , which means "any character that is not a semicolon:
String removed = ampStr.replaceAll("&[^;]*;");
This gives performance advantages over &.*?; for non-matching lines, so I highly recommend using this version, especially since not all HTML files will contain the &abc; token , and the version &.*?; can have tremendous neck performance as a result.
Brian source share