I collect data from RSS feeds, disinfect it and save it in a database. I use java, tidy, MySQL and JDBC.
Steps:
- I take RSS feeds. This is normal.
- I sanitize html with a neat one. Here is one transformation. Tidy automatically converts strings like "So it & # 8217; s unlikely" to "So unlikely."
- I save this row in a table
MySQL schema
CREATE TABLE IF NOT EXISTS `rss_item_safe_texts` ( `id` int(10) unsigned NOT NULL, `title` varchar(1000) NOT NULL, `link` varchar(255) NOT NULL, `description` mediumtext NOT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
JDBC Connection URL
connUrl = "jdbc:mysql://" + host + "/" + database + "?user=" + username + "&password=" + password + "&useUnicode=true&characterEncoding=UTF-8";
Java code
PreparedStatement updateSafeTextSt = conn.prepareStatement("UPDATE `rss_item_safe_texts` SET `title` = ?, `link` = ?, `description` = ? WHERE `id` = ?"); updateSafeTextSt.setString(1, EscapingUtils.escapeXssInjection(title)); updateSafeTextSt.setString(2, link); updateSafeTextSt.setString(3, EscapingUtils.escapeXssInjection(description)); updateSafeTextSt.setInt(4, itemId); updateSafeTextSt.execute(); updateSafeTextSt.close();
As a result, I see broken characters in the database, such as "So it" ? unlikely. "The same thing I see then is the output of text on a web page (utf-8 page).
source share