I am trying to pull an audio file from google text-to-speech function. Basically, you throw in a link, and then concatenated what you want you to say at the end of it. I got the code below to work perfectly for English, so I think the problem should be how the Chinese characters get the encoding in the request. Here is what I have:
String text = "text to be spoken"; public static final String AUDIO_CHINESE= "http://www.translate.google.com/translate_tts?tl=zh&q="; public static final String AUDIO_ENGLISH = "http://www.translate.google.com/translate_tts?tl=en&q="; URL url = new URL(AUDIO_ENGLISH + text); urlConnection = (HttpURLConnection) url.openConnection(); urlConnection.setRequestMethod("GET"); urlConnection.setRequestProperty("Accept-Charset", Variables.UTF_8); if (urlConnection.getResponseCode() ==200) { //get byte array in response in = new DataInputStream(urlConnection.getInputStream()); } else { in = new DataInputStream(urlConnection.getErrorStream()); } //use commons io byte[] bytes = IOUtils.toByteArray(in); in.close(); urlConnection.disconnect(); return bytes;
When I try to use this with Chinese characters, it returns what I cannot play in the media player (I suspect that this is not the correct audio file, since the vast majority of bytes are "85"). So I tried both
String chText = "你好"; URL url = new URL(AUDIO_CHINESE + URLEncoder.encode(chText, "UTF-8));
and
URL url = new URL(AUDIO_CHINESE + Uri.encode(chText, "UTF-8"));
and then adding
urlConnection.setRequestProperty("content-type", "application/x-www-form-urlencoded; charset=UTF-8");
in the request header. This only exacerbated the situation, because now it does not even return 200 code, instead specifying "FileNotFound" in logcat.
So, on a whim, I came back and tried the URL / Uri encoding with English text, and now English will not return the correct result either. Not sure what is going on here: the source url in the debugger works fine if I copy and paste in Chrome, but for some reason urlConnection just doesn't work. Feel like I missed something obvious.
EDIT
Fiddling with him has not yet shown any answer, just more confusion (and annoyance). For some reason, when sent via httpurlconnection, the Google tts machine reads the text encoded by utf-8 as utf-16, at least as far as I can tell. For example, the symbol "維" (wei2) is %E7%B6%AD , but if you pass it through the connection, you will get a file that says "see" ("ç" to be precise).
ç, as it turned out, 0x00E7 in UTF-16 (its version with utf is 8 percent code %C3%A7 ). I have no idea why this happens in Java, because placing the corresponding% at the end of the link in any browser will work correctly. So far I have tried to use various combinations, trying to get tts to read %E7%B6%AD without much success.
EDIT2
A solution to my problem has been found! See the answer below. The problem is not the encoding, it was parsing at the end of Google. Correspondingly edited the title. Hooray!