Normalization of possibly encoded URI strings in Java

Using Java, I want to remove the fragment identifier and do some simple normalization (e.g. line patterns, hosts) from a diverse set of URIs. The input and output URIs should be equivalent in the general sense of HTTP.

This should usually be simple. However, for URIs such as http://blah.org/A_%28Secret%29.xml#blah , what percentage encodes (Secret) , the behavior of java.util.URI makes life difficult.

The normalization method should return http://blah.org/A_%28Secret%29.xml , since the URI is http://blah.org/A_%28Secret%29.xml and http://blah.org/A_(Secret).xml not equivalent in interpretation [ยง2.2; RFC3968 ]

So, we have the following two normalization methods:

 URI u = new URI("http://blah.org/A_%28Secret%29.xml#blah"); System.out.println(u); // prints "http://blah.org/A_%28Secret%29.xml#blah" String path1 = u.getPath(); //gives "A_(Secret).xml" String path2 = u.getRawPath(); //gives "A_%28Secret%29.xml" //NORMALISE METHOD 1 URI norm1 = new URI(u.getScheme().toLowerCase(), u.getUserInfo(), u.getHost().toLowerCase(), u.getPort(), path1, u.getQuery(), null); System.out.println(norm1); // prints "http://blah.org/A_(Secret).xml" //NORMALISE METHOD 2 URI norm2 = new URI(u.getScheme().toLowerCase(), u.getUserInfo(), u.getHost().toLowerCase(), u.getPort(), path2, u.getQuery(), null); System.out.println(norm2); // prints "http://blah.org/A_%2528Secret%2529.xml" 

As we can see, the URI is parsed and rebuilt without a fragment identifier.

However, for method 1, u.getPath() returns an unencoded URI, which changes the final URI.

For method 2, u.getRawPath() returns the original path, but when it is passed to the URI constructor, Java decides to add double encoding.

It looks like a Chinese trap.

So, two main questions:

  • Why does java.util.URI feel the need to play with encoding?
  • How can this normalization method be implemented without using the original percentage encoding?

(I would prefer not to implement the parse / concatenate java.util.URI methods, which are nontrivial.)


EDIT: Here is another info from javadoc URI .

  • A constructor with a single argument requires that any illegal characters be specified in its argument, and retain any escaped octets and other characters that are present.

  • Constructors with multiple arguments quote invalid characters, as required by the components in which they are displayed. The percentage of the character ('%') is always quoted by these constructors. Any other characters are retained.

  • The getRawUserInfo, getRawPath , getRawQuery, getRawFragment, getRawAuthority and getRawSchemeSpecificPart methods return the values โ€‹โ€‹of their respective components in raw form without interpreting any escaped octets . The strings returned by these methods may contain escaped octets and not contain any other characters characters.

  • The getUserInfo, getPath , getQuery, getFragment, getAuthority and getSchemeSpecificPart methods decode any escaped octets in their respective components. The strings returned by these methods may contain both other characters and invalid characters and may not contain escaped octets.

  • The toString method returns a URI string with all the necessary quotation, but may contain other characters.

  • The toASCIIString method returns a fully quoted and encoded URI string that does not contain other characters.

Therefore, I cannot use a constructor with several arguments without having the URL encoding running inside the URI class. Ugh!

+6
source share
2 answers

Because java.net.URI is introduced in java 1.4 (which was released in 2002) and is based on RFC2396, which treats '(' and ')' as characters that are not needed , and the semantics do not change, even if it escapes, except Moreover, she even says that it should not be avoided if it is not necessary (ยง2.3, RFC2396).

But RFC3986 (which was released in 2005) changed this, and I think the JDK developers decide not to change the behavior of java.net.URI for compatibility of existing code.

By random googling, I found Jena IRI looks good.

 public class IRITest { public static void main(String[] args) { IRIFactory factory = IRIFactory.uriImplementation(); IRI iri = factory.construct("http://blah.org/A_%28Secret%29.xml#blah"); ArrayList<String> a = new ArrayList<String>(); a.add(iri.getScheme()); a.add(iri.getRawUserinfo()); a.add(iri.getRawHost()); a.add(iri.getRawPath()); a.add(iri.getRawQuery()); a.add(iri.getRawFragment()); IRI iri2 = factory.construct("http://blah.org/A_(Secret).xml#blah"); ArrayList<String> b = new ArrayList<String>(); b.add(iri2.getScheme()); b.add(iri2.getRawUserinfo()); b.add(iri2.getRawHost()); b.add(iri2.getRawPath()); b.add(iri2.getRawQuery()); b.add(iri2.getRawFragment()); System.out.println(a); //[http, null, blah.org, /A_%28Secret%29.xml, null, blah] System.out.println(b); //[http, null, blah.org, /A_(Secret).xml, null, blah] } } 
+9
source

Note this passage at the end [ยง2.2; RFC3968]

URIs that create applications should index percent octets of data that correspond to the characters of the reserved set, unless these characters are specifically permitted by the URI scheme to represent data in this component. If a reserved character is found in the URI component and the demarcation role is not defined for this character, then it should be interpreted as representing a data octet corresponding to this character encoding in US-ASCII.

So, as long as this scheme is http or https, encoding is the correct behavior.

Try using toASCIIString instead of toString to print the URI. For instance:.

 System.put.println(norm1.toASCIIString()); 
+4
source

Source: https://habr.com/ru/post/909213/


All Articles