Using a boiler room to extract non-English articles

I am trying to use the boilerpipe java library to retrieve news articles from a collection of websites. It works great for texts in English, but for text with special characters, such as accented words (histΓ³ria), these special characters are not extracted correctly. I think this is a coding problem.

The boiler faq file says: β€œIf you are extracting non-English text, you may need to change some parameters,” and then refers to the document. I did not find a solution in this article.

My question is: are there any parameters when using the boiler, where can I specify the encoding? Is there a way to get around and get the text correctly?

How I use the library: (first try based on url):

URL url = new URL(link); String article = ArticleExtractor.INSTANCE.getText(url); 

(second in HTLM source code)

 String article = ArticleExtractor.INSTANCE.getText(html_page_as_string); 
+6
source share
6 answers

Well, the solution is. As Andrei said, I had to change the HTMLFecther class, which is in the package de.l3s.boilerpipe.sax, I did to convert all the text that was extracted to UTF-8. At the end of the fetch function, I had to add two lines and change the last:

 final byte[] data = bos.toByteArray(); //stays the same byte[] utf8 = new String(data, cs.displayName()).getBytes("UTF-8"); //new one (convertion) cs = Charset.forName("UTF-8"); //set the charset to UFT-8 return new HTMLDocument(utf8, cs); // edited line 
+1
source

You do not need to modify Boilerpipe inner classes.

Just pass the InputSource object to the ArticleExtractor.INSTANCE.getText() method and force the object to encode. For instance:

 URL url = new URL("http://some-page-with-utf8-encodeing.tld"); InputSource is = new InputSource(); is.setEncoding("UTF-8"); is.setByteStream(url.openStream()); String text = ArticleExtractor.INSTANCE.getText(is); 

Hello!

+2
source

Well, from what I see, when you use it like this, the library will automatically choose which encoding to use. From HTMLFetcher source:

 public static HTMLDocument fetch(final URL url) throws IOException { final URLConnection conn = url.openConnection(); final String ct = conn.getContentType(); Charset cs = Charset.forName("Cp1252"); if (ct != null) { Matcher m = PAT_CHARSET.matcher(ct); if(m.find()) { final String charset = m.group(1); try { cs = Charset.forName(charset); } catch (UnsupportedCharsetException e) { // keep default } } } 

Try debugging your code a bit starting with ArticleExtractor.getText(URL) and see if you can redefine the encoding

+1
source

Boilerpipe ArticleExtractor uses some algorithms specially adapted to English - measuring the number of words in average phrases, etc. In any language that is more or less verbose than English (i.e., every other language), these algorithms will be less accurate.

In addition, the library uses some English phrases to try to find the end of the article (comments, commentary, your opinion, etc.), which obviously will not work in other languages.

This does not mean that the library will end completely - just be aware that some changes are probably necessary for good results in non-English languages.

+1
source

Java:

 import java.net.URL; import org.xml.sax.InputSource; import de.l3s.boilerpipe.extractors.ArticleExtractor; public class Boilerpipe { public static void main(String[] args) { try{ URL url = new URL("http://www.azeri.ru/az/traditions/kuraj_pehlevanov/"); InputSource is = new InputSource(); is.setEncoding("UTF-8"); is.setByteStream(url.openStream()); String text = ArticleExtractor.INSTANCE.getText(is); System.out.println(text); }catch(Exception e){ e.printStackTrace(); } } } 

Eclipse: Run> Run Configurations> General Tab. Set the encoding to another (UTF-8), then click Run.

enter image description here

+1
source

I had some kind of problem; cnr solution works fine. Just change the encoding of UTF-8 to ISO-8859-1. Thanks x

 URL url = new URL("http://some-page-with-utf8-encodeing.tld"); InputSource is = new InputSource(); is.setEncoding("ISO-8859-1"); is.setByteStream(url.openStream()); String text = ArticleExtractor.INSTANCE.getText(is); 
0
source

Source: https://habr.com/ru/post/908331/


All Articles