How to use Google Text-to-Speech for Chinese characters on Android?

I am trying to pull an audio file from google text-to-speech function. Basically, you throw in a link, and then concatenated what you want you to say at the end of it. I got the code below to work perfectly for English, so I think the problem should be how the Chinese characters get the encoding in the request. Here is what I have:

String text = "text to be spoken"; public static final String AUDIO_CHINESE= "http://www.translate.google.com/translate_tts?tl=zh&q="; public static final String AUDIO_ENGLISH = "http://www.translate.google.com/translate_tts?tl=en&q="; URL url = new URL(AUDIO_ENGLISH + text); urlConnection = (HttpURLConnection) url.openConnection(); urlConnection.setRequestMethod("GET"); urlConnection.setRequestProperty("Accept-Charset", Variables.UTF_8); if (urlConnection.getResponseCode() ==200) { //get byte array in response in = new DataInputStream(urlConnection.getInputStream()); } else { in = new DataInputStream(urlConnection.getErrorStream()); } //use commons io byte[] bytes = IOUtils.toByteArray(in); in.close(); urlConnection.disconnect(); return bytes; 

When I try to use this with Chinese characters, it returns what I cannot play in the media player (I suspect that this is not the correct audio file, since the vast majority of bytes are "85"). So I tried both

 String chText = "你好"; URL url = new URL(AUDIO_CHINESE + URLEncoder.encode(chText, "UTF-8)); 

and

 URL url = new URL(AUDIO_CHINESE + Uri.encode(chText, "UTF-8")); 

and then adding

 urlConnection.setRequestProperty("content-type", "application/x-www-form-urlencoded; charset=UTF-8"); 

in the request header. This only exacerbated the situation, because now it does not even return 200 code, instead specifying "FileNotFound" in logcat.

So, on a whim, I came back and tried the URL / Uri encoding with English text, and now English will not return the correct result either. Not sure what is going on here: the source url in the debugger works fine if I copy and paste in Chrome, but for some reason urlConnection just doesn't work. Feel like I missed something obvious.

EDIT

Fiddling with him has not yet shown any answer, just more confusion (and annoyance). For some reason, when sent via httpurlconnection, the Google tts machine reads the text encoded by utf-8 as utf-16, at least as far as I can tell. For example, the symbol "維" (wei2) is %E7%B6%AD , but if you pass it through the connection, you will get a file that says "see" ("ç" to be precise).

ç, as it turned out, 0x00E7 in UTF-16 (its version with utf is 8 percent code %C3%A7 ). I have no idea why this happens in Java, because placing the corresponding% at the end of the link in any browser will work correctly. So far I have tried to use various combinations, trying to get tts to read %E7%B6%AD without much success.

EDIT2

A solution to my problem has been found! See the answer below. The problem is not the encoding, it was parsing at the end of Google. Correspondingly edited the title. Hooray!

+6
source share
1 answer

So, as it turned out, the problem at the end was not encoding at all; it was processing at the end of google. For the service to correctly recognize UTF-8, you need to use this link http://www.translate.google.com/translate_tts?ie=utf-8&tl=zh-cn&q= instead of the above. Note the ie=utf-8 parameter added to the parameter. So you can just URLEncoder.encode("你好嗎", "UTF-8") , add it to the link and send it as usual. Phew!

+4
source

Source: https://habr.com/ru/post/981597/


All Articles