How to convert a string of Russian Cyrillic letters?

Question

How to convert a string of Russian Cyrillic letters?

I am parsing mp3 tags.

String artist - I don't know what was encoded

Ïåñíÿ ïðî íàäåæäó - example of a line in Russian " "

I am using http://code.google.com/p/juniversalchardet/

the code:

 String GetEncoding(String text) throws IOException { byte[] buf = new byte[4096]; InputStream fis = new ByteArrayInputStream(text.getBytes()); UniversalDetector detector = new UniversalDetector(null); int nread; while ((nread = fis.read(buf)) > 0 && !detector.isDone()) { detector.handleData(buf, 0, nread); } detector.dataEnd(); String encoding = detector.getDetectedCharset(); detector.reset(); return encoding; }

And hidden

new String(text.getBytes(encoding), "cp1251"); but it does not work.

if i use utf-16

new String(text.getBytes("UTF-16"), "cp1251") return "ya

EDIT:

this first byte read

 byte[] abyFrameData = new byte[iTagSize]; oID3DIS.readFully(abyFrameData); ByteArrayInputStream oFrameBAIS = new ByteArrayInputStream(abyFrameData);

String s = new String (abyFrameData, "????");

+6

java encoding

Mediator May 16 '11 at 11:59

source share

2 answers

This works for me:

 byte[] bytes = s.getBytes("ISO-8859-1"); UniversalDetector encDetector = new UniversalDetector(null); encDetector.handleData(bytes, 0, bytes.length); encDetector.dataEnd(); String encoding = encDetector.getDetectedCharset(); if (encoding != null) s = new String(bytes, encoding);

0

Nik May 07 '14 at 6:11

source share

Mcdowell · Accepted Answer · 2011-05-16T13:03:50+0000

Java strings are UTF-16. All other encodings can be represented using byte sequences. To decode character data, you must provide an encoding when you first create a string. If you have a damaged string, it is already too late.

Assuming ID3, the specifications define coding rules. For example, ID3v2.4.0 may limit the encodings used by the extended header:

q - Text Encoding Limitations

  0 No restrictions 1 Strings are only encoded with ISO-8859-1 [ISO-8859-1] or UTF-8 [UTF-8].

Encoding processing is further defined in the document:

If nothing is said, strings, including numeric strings and URLs, are represented as ISO-8859-1 characters in the range of $ 20 to $ FF. Such lines are represented in the description frame as <text string> , or <full text string> If newlines are allowed. If nothing is said a newline is prohibited. ISO-8859-1 introduces a new line, if permitted, with only $ 0A.
Frames that allow various types of text encoding contain text encoded description bytes. Possible encodings:
  $00 ISO-8859-1 [ISO-8859-1]. Terminated with $00. $01 UTF-16 [UTF-16] encoded Unicode [UNICODE] with BOM. All strings in the same frame SHALL have the same byteorder. Terminated with $00 00. $02 UTF-16BE [UTF-16] encoded Unicode [UNICODE] without BOM. Terminated with $00 00. $03 UTF-8 [UTF-8] encoded Unicode [UNICODE]. Terminated with $00. 

Use transcoding classes such as InputStreamReader or (more likely in this case) the String(byte[],Charset) constructor String(byte[],Charset) to decode the data. See Also Java: An Approximate Guide to Character Encoding .

The analysis of the string components of the ID3v2.4.0 data structure will be something like this:

 //untested code public String parseID3String(DataInputStream in) throws IOException { String[] encodings = { "ISO-8859-1", "UTF-16", "UTF-16BE", "UTF-8" }; String encoding = encodings[in.read()]; byte[] terminator = encoding.startsWith("UTF-16") ? new byte[2] : new byte[1]; byte[] buf = terminator.clone(); ByteArrayOutputStream buffer = new ByteArrayOutputStream(); do { in.readFully(buf); buffer.write(buf); } while (!Arrays.equals(terminator, buf)); return new String(buffer.toByteArray(), encoding); }

How to convert a string of Russian Cyrillic letters?

More articles: