I am trying to use the boilerpipe java library to retrieve news articles from a collection of websites. It works great for texts in English, but for text with special characters, such as accented words (histΓ³ria), these special characters are not extracted correctly. I think this is a coding problem.
The boiler faq file says: βIf you are extracting non-English text, you may need to change some parameters,β and then refers to the document. I did not find a solution in this article.
My question is: are there any parameters when using the boiler, where can I specify the encoding? Is there a way to get around and get the text correctly?
How I use the library: (first try based on url):
URL url = new URL(link); String article = ArticleExtractor.INSTANCE.getText(url);
(second in HTLM source code)
String article = ArticleExtractor.INSTANCE.getText(html_page_as_string);
source share