Why does maven give me different utf-8 characters than eclipse (test run in eclipse, fail in maven)?

My current project is about natural language analysis. One test reads text from a file, deletes certain characters and tokenizes the text into separate words. The test actually compares the number of unique words. In eclipse, this test is green, in maven, I get more words than expected. When comparing word lists, I see the following additional words:

  • acquirer⊙s
  • card⊙s
  • institution⊙s
  • issuer⊙s
  • provider⊙s
  • psam⊙s
  • ⊜from⊝
  • ⊜slot⊝
  • ⊜to⊝

Looking at the source of the text, it contains the following characters that should be filtered out: ""

This works in eclipse, but not in maven. I am using utf-8. The files seem to be correctly encoded, in maven pom I specify the following:

<properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <org.apache.lucene.version>3.6.0</org.apache.lucene.version> </properties> 

Edit: Here is the code that reads the file (which, according to eclipse, is encoded as UTF-8).

  BufferedReader reader = new BufferedReader( new FileReader(this.file)); String line = ""; while ((line = reader.readLine()) != null) { // the csv contains a text and a classification String[] reqCatType = line.split(";"); String reqText = reqCatType[0].trim(); String reqCategory = reqCatType[1].trim(); // the tokenizer also removes unwanted characters: String[] sentence = this.filter.filterStopWords(this.tokenizer .tokenize(reqText)); // we use this data to train a machine learning algorithm this.dataSet.learn(sentence, reqCategory); } reader.close(); 

Edit: The following information may be useful for analyzing the problem:

 mvn -v Apache Maven 3.0.3 (r1075438; 2011-02-28 09:31:09-0800) Maven home: /usr/share/maven Java version: 1.6.0_33, vendor: Apple Inc. Java home: /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home Default locale: en_US, platform encoding: MacRoman OS name: "mac os x", version: "10.6.8", arch: "x86_64", family: "mac" 
+4
source share
1 answer

So your data file is in UTF-8. The eclipse options in this file are irrelevant, since the Java executable will be instructions that interpret the value.

FileReader always uses a standard encoding platform, which is usually a bad idea. Eclipse will most likely set platorm default for you, while Maven will not.

Correct your code to indicate the encoding.

See JavaDoc:

 To specify these values yourself, construct an InputStreamReader on a FileInputStream. 
+3
source

Source: https://habr.com/ru/post/1432384/


All Articles