My current project is about natural language analysis. One test reads text from a file, deletes certain characters and tokenizes the text into separate words. The test actually compares the number of unique words. In eclipse, this test is green, in maven, I get more words than expected. When comparing word lists, I see the following additional words:
- acquirer⊙s
- card⊙s
- institution⊙s
- issuer⊙s
- provider⊙s
- psam⊙s
- ⊜from⊝
- ⊜slot⊝
- ⊜to⊝
Looking at the source of the text, it contains the following characters that should be filtered out: ""
This works in eclipse, but not in maven. I am using utf-8. The files seem to be correctly encoded, in maven pom I specify the following:
<properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <org.apache.lucene.version>3.6.0</org.apache.lucene.version> </properties>
Edit: Here is the code that reads the file (which, according to eclipse, is encoded as UTF-8).
BufferedReader reader = new BufferedReader( new FileReader(this.file)); String line = ""; while ((line = reader.readLine()) != null) {
Edit: The following information may be useful for analyzing the problem:
mvn -v Apache Maven 3.0.3 (r1075438; 2011-02-28 09:31:09-0800) Maven home: /usr/share/maven Java version: 1.6.0_33, vendor: Apple Inc. Java home: /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home Default locale: en_US, platform encoding: MacRoman OS name: "mac os x", version: "10.6.8", arch: "x86_64", family: "mac"
source share