First of all, I use:
Java 1.7.0_02 MySQL 5.1.50 ZendServer CE (if that matters)
The JDBC driver that I use to connect to MySQL with Java is com.mysql.jdbc.Driver . The database connection works fine.
My connection string:
jdbc:mysql:
And in trying to solve the problem that I have, I added
?useUnicode=true&characterEncoding=UTF-8
to the connection string.
I work with the Wikipedia dump, all the text is in MediaWiki format, and I parse content with JWPL, which works great for me, and in the process of pulling it out of the database, analyzing and displaying through HTML, I lose the “-” characters and single quotes and instead, I get Earth s instead of Earth's .
After some testing, I welded that the characters are not encoded / decoded properly, between the MySQL query and String processing in Java, I came to this conclusion because the text in the database (stored as MEDIUMBLOB ) has the correct characters, as it should , and direct line output in Java after the DB call broke / missed characters ("?????" instead of Japanese characters, etc.).
I confirmed that System.getProperty("file.encoding"); is UTF-8, so the JVM must encode String when printed correctly (unless something is unclear with the JVM conversion UTF-8> UTF-16> UTF-8.
I also created a UTF-8 table with UTF-8 columns and moved the data into it into a database for testing, which did not solve anything. Another attempt to fix has been replaced:
return result.getString("old_text");
which pulls text from a result set:
return new String(result.getString("old_text").getBytes("utf8"), "utf8");
which gave me the same results as the previous statement.
Is there any way to avoid this loss of character data when accessing MySQL from JDBC, if not, is there a way to handle the characters and restore the correct character to show? Two and three random character blocks instead of the standard punctuation type interrupt the user's work.
EDIT
A short note, the data in the database is in order - the characters are present, all of them are visible. Date access thruogh phpMyAdmin returns data with correctly encoded characters. The problem arises somewhere between MySQL and Java, possibly with JDBC. I am looking for settings or a workaround (which works since the ones I tried did not work for me) that will prevent the loss of these character codes.