How to convert a string of Russian Cyrillic letters?

I am parsing mp3 tags.

String artist - I don't know what was encoded

Ïåñíÿ ïðî íàäåæäó - example of a line in Russian " "

I am using http://code.google.com/p/juniversalchardet/

the code:

 String GetEncoding(String text) throws IOException { byte[] buf = new byte[4096]; InputStream fis = new ByteArrayInputStream(text.getBytes()); UniversalDetector detector = new UniversalDetector(null); int nread; while ((nread = fis.read(buf)) > 0 && !detector.isDone()) { detector.handleData(buf, 0, nread); } detector.dataEnd(); String encoding = detector.getDetectedCharset(); detector.reset(); return encoding; } 

And hidden

new String(text.getBytes(encoding), "cp1251"); but it does not work.

if i use utf-16

new String(text.getBytes("UTF-16"), "cp1251") return "ya

EDIT:

this first byte read

 byte[] abyFrameData = new byte[iTagSize]; oID3DIS.readFully(abyFrameData); ByteArrayInputStream oFrameBAIS = new ByteArrayInputStream(abyFrameData); 

String s = new String (abyFrameData, "????");

+6
source share
2 answers

Java strings are UTF-16. All other encodings can be represented using byte sequences. To decode character data, you must provide an encoding when you first create a string. If you have a damaged string, it is already too late.

Assuming ID3, the specifications define coding rules. For example, ID3v2.4.0 may limit the encodings used by the extended header:

q - Text Encoding Limitations

  0 No restrictions 1 Strings are only encoded with ISO-8859-1 [ISO-8859-1] or UTF-8 [UTF-8]. 

Encoding processing is further defined in the document:

If nothing is said, strings, including numeric strings and URLs, are represented as ISO-8859-1 characters in the range of $ 20 to $ FF. Such lines are represented in the description frame as <text string> , or <full text string> If newlines are allowed. If nothing is said a newline is prohibited. ISO-8859-1 introduces a new line, if permitted, with only $ 0A.

Frames that allow various types of text encoding contain text encoded description bytes. Possible encodings:

  $00 ISO-8859-1 [ISO-8859-1]. Terminated with $00. $01 UTF-16 [UTF-16] encoded Unicode [UNICODE] with BOM. All strings in the same frame SHALL have the same byteorder. Terminated with $00 00. $02 UTF-16BE [UTF-16] encoded Unicode [UNICODE] without BOM. Terminated with $00 00. $03 UTF-8 [UTF-8] encoded Unicode [UNICODE]. Terminated with $00. 

Use transcoding classes such as InputStreamReader or (more likely in this case) the String(byte[],Charset) constructor String(byte[],Charset) to decode the data. See Also Java: An Approximate Guide to Character Encoding .


The analysis of the string components of the ID3v2.4.0 data structure will be something like this:

 //untested code public String parseID3String(DataInputStream in) throws IOException { String[] encodings = { "ISO-8859-1", "UTF-16", "UTF-16BE", "UTF-8" }; String encoding = encodings[in.read()]; byte[] terminator = encoding.startsWith("UTF-16") ? new byte[2] : new byte[1]; byte[] buf = terminator.clone(); ByteArrayOutputStream buffer = new ByteArrayOutputStream(); do { in.readFully(buf); buffer.write(buf); } while (!Arrays.equals(terminator, buf)); return new String(buffer.toByteArray(), encoding); } 
+4
source

This works for me:

 byte[] bytes = s.getBytes("ISO-8859-1"); UniversalDetector encDetector = new UniversalDetector(null); encDetector.handleData(bytes, 0, bytes.length); encDetector.dataEnd(); String encoding = encDetector.getDetectedCharset(); if (encoding != null) s = new String(bytes, encoding); 
0
source

Source: https://habr.com/ru/post/888215/


All Articles