How to avoid punctuation loss when retrieving data from MySQL database using JDBC?

First of all, I use:

Java 1.7.0_02 MySQL 5.1.50 ZendServer CE (if that matters) 

The JDBC driver that I use to connect to MySQL with Java is com.mysql.jdbc.Driver . The database connection works fine.

My connection string:

 jdbc:mysql://localhost:3306/table 

And in trying to solve the problem that I have, I added

 ?useUnicode=true&characterEncoding=UTF-8 

to the connection string.

I work with the Wikipedia dump, all the text is in MediaWiki format, and I parse content with JWPL, which works great for me, and in the process of pulling it out of the database, analyzing and displaying through HTML, I lose the “-” characters and single quotes and instead, I get Earth s instead of Earth's .

After some testing, I welded that the characters are not encoded / decoded properly, between the MySQL query and String processing in Java, I came to this conclusion because the text in the database (stored as MEDIUMBLOB ) has the correct characters, as it should , and direct line output in Java after the DB call broke / missed characters ("?????" instead of Japanese characters, etc.).

I confirmed that System.getProperty("file.encoding"); is UTF-8, so the JVM must encode String when printed correctly (unless something is unclear with the JVM conversion UTF-8> UTF-16> UTF-8.

I also created a UTF-8 table with UTF-8 columns and moved the data into it into a database for testing, which did not solve anything. Another attempt to fix has been replaced:

 return result.getString("old_text"); 

which pulls text from a result set:

 return new String(result.getString("old_text").getBytes("utf8"), "utf8"); 

which gave me the same results as the previous statement.

Is there any way to avoid this loss of character data when accessing MySQL from JDBC, if not, is there a way to handle the characters and restore the correct character to show? Two and three random character blocks instead of the standard punctuation type interrupt the user's work.

EDIT

A short note, the data in the database is in order - the characters are present, all of them are visible. Date access thruogh phpMyAdmin returns data with correctly encoded characters. The problem arises somewhere between MySQL and Java, possibly with JDBC. I am looking for settings or a workaround (which works since the ones I tried did not work for me) that will prevent the loss of these character codes.

+4
source share
2 answers

After some research and reading, I came to find a solution that fixed the problems that I had. I can't say why, but it looks like it was converting MEDIUMBLOB to String type in Java.

This is how I returned the text from the result:

 if (result.next()) return result.getString("old_text"); else return null; 

I have not done much with JDBC in the past and did not know that the Blob class exists, so I changed the code to:

 if (result.next()) { Blob blob = result.getBlob("old_text"); InputStream is = blob.getBinaryStream(); byte[] bytes = new byte[is.available()]; is.read(bytes); is.close(); return new String(bytes, "UTF-8"); } else return null; 

And it works great.

+1
source

I think the problem should be how you encode and decode bytes in Blob. Probably because the default encoding is not what you think.

I would recommend you get and put byte arrays, and you will explicitly specify the UTF-8 encoding when converting strings to byte arrays and vice versa. Do not rely on default encoding assumptions.

FWIW, the right way to find out what the default JVM encoding is is to look at the object returned by Charset.defaultCharset() .

0
source

Source: https://habr.com/ru/post/1388534/


All Articles