HttpClient and non-ASCII URLs (á, é, í, ó, ú)

"Long time reader, first poster" here.

I am creating a bot for a Spanish wiki that I administer. I wanted to do this from scratch, as one of my goals is to practice Java. However, I ran into some problems when trying to make GET requests with HttpClient in URIs that contain non-ASCII characters, such as á, é, í, ó or ú.

String url = "http://es.metroid.wikia.com/api.php?action=query&list=categorymembers&cmtitle=Categoría:Mejoras de las Botas" method = new GetMethod(url); client.executeMethod(method); 

When I do this, GetMethod complains about a URI:

 Exception in thread "main" java.lang.IllegalArgumentException: Invalid uri 'http://es.pruebaloca.wikia.com/api.php?action=query&list=categorymembers&cmtitle=Categoría:Mejoras%20de%20las%20Botas&cmlimit=500&format=xml': Invalid query at org.apache.commons.httpclient.HttpMethodBase.<init>(HttpMethodBase.java:222) at org.apache.commons.httpclient.methods.GetMethod.<init>(GetMethod.java:89) at net.metroidover.categorybot.http.HttpRequest.request(HttpRequest.java:69) at net.metroidover.categorybot.http.HttpRequest.request(HttpRequest.java:120) at net.metroidover.categorybot.http.Action.getCategoryMembers(Action.java:38) at net.metroidover.categorybot.bot.BotComponent.<init>(BotComponent.java:58) at net.metroidover.categorybot.bot.BotComponent.main(BotComponent.java:80) 

Note that in the URI displayed in the stack trace, spaces are encoded in %20 , and í remain as they are. The same URI works fine in the browser, but I can't get around it in GetMethod.

I also tried the following:

 URI uri = new URI(url, false); method = new GetMethod(uri.getEscapedURI()); client.executeMethod(method); 

So the URI escaped i s but escaped spaces twice ( %2520 ) ...

 http://es.metroid.wikia.com/api.php?action=query&list=categorymembers&cmtitle=Categor%C3%ADa:Mejoras%2520de%2520las%2520Botas&cmlimit=500&format=xml 

Now, if I do not use spaces in the query, there is no double escaping, and I get the desired result. Therefore, if there was no possibility of non-ASCII characters, I would not need to use the URI class and not get double escaped. To avoid the first disappearance of spaces, I tried this:

 URI uri = new URI(url, true); method = new GetMethod(uri.getEscapedURI()); client.executeMethod(method); 

But the URI class did not like:

 org.apache.commons.httpclient.URIException: Invalid query at org.apache.commons.httpclient.URI.parseUriReference(URI.java:2049) at org.apache.commons.httpclient.URI.<init>(URI.java:167) at net.metroidover.categorybot.http.HttpRequest.request(HttpRequest.java:66) at net.metroidover.categorybot.http.HttpRequest.request(HttpRequest.java:121) at net.metroidover.categorybot.http.Action.getCategoryMembers(Action.java:38) at net.metroidover.categorybot.bot.BotComponent.<init>(BotComponent.java:58) at net.metroidover.categorybot.bot.BotComponent.main(BotComponent.java:80) Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 1, Size: 0 at java.util.ArrayList.RangeCheck(ArrayList.java:547) at java.util.ArrayList.get(ArrayList.java:322) at net.metroidover.categorybot.http.Action.getCategoryMembers(Action.java:39) at net.metroidover.categorybot.bot.BotComponent.<init>(BotComponent.java:58) at net.metroidover.categorybot.bot.BotComponent.main(BotComponent.java:80) 

Any input on how to avoid this double shielding is welcome. I was hiding with everyone without any luck.

Thanks!

Edit: The solution that works best for me is parsifical, but as an addition, I would like to say that setting the path using method.setPath(url) made by HttpMethod reject the cookie I needed to save:

 Aug 26, 2011 4:07:08 PM org.apache.commons.httpclient.HttpMethodBase processCookieHeaders WARNING: Cookie rejected: "wikicities_session=900beded4191ff880e09944c7c0aaf5a". Illegal path attribute "/". Path of origin: "http://es.metroid.wikia.com/api.php" 

However, if I send the URI to the constructor and forget about setPath(url) , the cookie will be saved without problems.

 String url = "http://es.metroid.wikia.com/api.php"; NameValuePair[] query = { new NameValuePair("action", "query"), new NameValuePair("list", "categorymembers"), new NameValuePair("cmtitle", "Categoría:Mejoras de las Botas"), new NameValuePair("cmlimit", "500"), new NameValuePair("format", "xml") }; HttpMethod method = null; ... method = new GetMethod(url); // Or PostMethod(url) method.getParams().setCookiePolicy(CookiePolicy.BROWSER_COMPATIBILITY); // It had been like this the whole time method.setQueryString(query); client.executeMethod(method); 
+4
source share
3 answers

Looking at the HttpMethodBase documentation, it seems that all String parameters should be pre-encoded. The simplest solution is to build your URL in stages, with setPath() and the setQueryString() option, which takes an array of name parameters.

+2
source

I would recommend using UrlEncoder to encode your queryString values ​​(not all queryString).

 UrlEncoder.encode("Categoría:Mejoras de las Botas", "UTF-8"); 
+5
source

why don't you try adding parameters like NameValuePair , the problem is that when you avoid the URL, everything in the URL is escaped, including things like http: // .. that's why the system complains.

you can also avoid only arguments using URLEncoder.encode() , just pass get parameters and add the return value to the URL.

String url = "http://es.metroid.wikia.com/api.php?"+URLEncoder.encode("action=query&list=categorymembers&cmtitle=Categoría:Mejoras de las Botas");

-1
source

Source: https://habr.com/ru/post/1369003/


All Articles