Using Java, I want to remove the fragment identifier and do some simple normalization (e.g. line patterns, hosts) from a diverse set of URIs. The input and output URIs should be equivalent in the general sense of HTTP.
This should usually be simple. However, for URIs such as http://blah.org/A_%28Secret%29.xml#blah
, what percentage encodes (Secret)
, the behavior of java.util.URI
makes life difficult.
The normalization method should return http://blah.org/A_%28Secret%29.xml
, since the URI is http://blah.org/A_%28Secret%29.xml
and http://blah.org/A_(Secret).xml
not equivalent in interpretation [ยง2.2; RFC3968 ]
So, we have the following two normalization methods:
URI u = new URI("http://blah.org/A_%28Secret%29.xml#blah"); System.out.println(u); // prints "http://blah.org/A_%28Secret%29.xml#blah" String path1 = u.getPath(); //gives "A_(Secret).xml" String path2 = u.getRawPath(); //gives "A_%28Secret%29.xml" //NORMALISE METHOD 1 URI norm1 = new URI(u.getScheme().toLowerCase(), u.getUserInfo(), u.getHost().toLowerCase(), u.getPort(), path1, u.getQuery(), null); System.out.println(norm1); // prints "http://blah.org/A_(Secret).xml" //NORMALISE METHOD 2 URI norm2 = new URI(u.getScheme().toLowerCase(), u.getUserInfo(), u.getHost().toLowerCase(), u.getPort(), path2, u.getQuery(), null); System.out.println(norm2); // prints "http://blah.org/A_%2528Secret%2529.xml"
As we can see, the URI is parsed and rebuilt without a fragment identifier.
However, for method 1, u.getPath()
returns an unencoded URI, which changes the final URI.
For method 2, u.getRawPath()
returns the original path, but when it is passed to the URI
constructor, Java decides to add double encoding.
It looks like a Chinese trap.
So, two main questions:
- Why does
java.util.URI
feel the need to play with encoding? - How can this normalization method be implemented without using the original percentage encoding?
(I would prefer not to implement the parse / concatenate java.util.URI
methods, which are nontrivial.)
EDIT: Here is another info from javadoc URI
.
A constructor with a single argument requires that any illegal characters be specified in its argument, and retain any escaped octets and other characters that are present.
Constructors with multiple arguments quote invalid characters, as required by the components in which they are displayed. The percentage of the character ('%') is always quoted by these constructors. Any other characters are retained.
The getRawUserInfo, getRawPath , getRawQuery, getRawFragment, getRawAuthority and getRawSchemeSpecificPart methods return the values โโof their respective components in raw form without interpreting any escaped octets . The strings returned by these methods may contain escaped octets and not contain any other characters characters.
The getUserInfo, getPath , getQuery, getFragment, getAuthority and getSchemeSpecificPart methods decode any escaped octets in their respective components. The strings returned by these methods may contain both other characters and invalid characters and may not contain escaped octets.
The toString method returns a URI string with all the necessary quotation, but may contain other characters.
The toASCIIString method returns a fully quoted and encoded URI string that does not contain other characters.
Therefore, I cannot use a constructor with several arguments without having the URL encoding running inside the URI
class. Ugh!