Using a boiler room to extract non-English articles

Question

Using a boiler room to extract non-English articles

I am trying to use the boilerpipe java library to retrieve news articles from a collection of websites. It works great for texts in English, but for text with special characters, such as accented words (história), these special characters are not extracted correctly. I think this is a coding problem.

The boiler faq file says: “If you are extracting non-English text, you may need to change some parameters,” and then refers to the document. I did not find a solution in this article.

My question is: are there any parameters when using the boiler, where can I specify the encoding? Is there a way to get around and get the text correctly?

How I use the library: (first try based on url):

URL url = new URL(link); String article = ArticleExtractor.INSTANCE.getText(url);

(second in HTLM source code)

 String article = ArticleExtractor.INSTANCE.getText(html_page_as_string);

+6

java html text-extraction

pedro_silva Feb 13 '12 at 11:51

source share

6 answers

You do not need to modify Boilerpipe inner classes.

Just pass the InputSource object to the ArticleExtractor.INSTANCE.getText() method and force the object to encode. For instance:

 URL url = new URL("http://some-page-with-utf8-encodeing.tld"); InputSource is = new InputSource(); is.setEncoding("UTF-8"); is.setByteStream(url.openStream()); String text = ArticleExtractor.INSTANCE.getText(is);

Hello!

+2

cnr .. Jun 05 '12 at 12:31

source share

Well, from what I see, when you use it like this, the library will automatically choose which encoding to use. From HTMLFetcher source:

 public static HTMLDocument fetch(final URL url) throws IOException { final URLConnection conn = url.openConnection(); final String ct = conn.getContentType(); Charset cs = Charset.forName("Cp1252"); if (ct != null) { Matcher m = PAT_CHARSET.matcher(ct); if(m.find()) { final String charset = m.group(1); try { cs = Charset.forName(charset); } catch (UnsupportedCharsetException e) { // keep default } } }

Try debugging your code a bit starting with ArticleExtractor.getText(URL) and see if you can redefine the encoding

+1

Shivan dragon Feb 13 '12 at 12:07

source share

Boilerpipe ArticleExtractor uses some algorithms specially adapted to English - measuring the number of words in average phrases, etc. In any language that is more or less verbose than English (i.e., every other language), these algorithms will be less accurate.

In addition, the library uses some English phrases to try to find the end of the article (comments, commentary, your opinion, etc.), which obviously will not work in other languages.

This does not mean that the library will end completely - just be aware that some changes are probably necessary for good results in non-English languages.

+1

Luke Feb 07 '14 at 14:37

source share

Java:

 import java.net.URL; import org.xml.sax.InputSource; import de.l3s.boilerpipe.extractors.ArticleExtractor; public class Boilerpipe { public static void main(String[] args) { try{ URL url = new URL("http://www.azeri.ru/az/traditions/kuraj_pehlevanov/"); InputSource is = new InputSource(); is.setEncoding("UTF-8"); is.setByteStream(url.openStream()); String text = ArticleExtractor.INSTANCE.getText(is); System.out.println(text); }catch(Exception e){ e.printStackTrace(); } } }

Eclipse: Run> Run Configurations> General Tab. Set the encoding to another (UTF-8), then click Run.

+1

Chris Jul 27 '14 at 19:25

source share

I had some kind of problem; cnr solution works fine. Just change the encoding of UTF-8 to ISO-8859-1. Thanks x

 URL url = new URL("http://some-page-with-utf8-encodeing.tld"); InputSource is = new InputSource(); is.setEncoding("ISO-8859-1"); is.setByteStream(url.openStream()); String text = ArticleExtractor.INSTANCE.getText(is);

0

crowler Jun 2 '13 at 18:09

source share

pedro_silva · Accepted Answer · 2012-03-06T15:31:56+0000

Well, the solution is. As Andrei said, I had to change the HTMLFecther class, which is in the package de.l3s.boilerpipe.sax, I did to convert all the text that was extracted to UTF-8. At the end of the fetch function, I had to add two lines and change the last:

 final byte[] data = bos.toByteArray(); //stays the same byte[] utf8 = new String(data, cs.displayName()).getBytes("UTF-8"); //new one (convertion) cs = Charset.forName("UTF-8"); //set the charset to UFT-8 return new HTMLDocument(utf8, cs); // edited line

Using a boiler room to extract non-English articles

More articles: