Character corruption comes from BufferedReader to BufferedWriter in java

Question

Character corruption comes from BufferedReader to BufferedWriter in java

In Java, I am trying to parse an HTML file containing complex text such as Greek characters.

I encounter a known issue when the text contains a left quote mark. Text for example

mutations to particular "hotspot" regions

becomes

 mutations to particular "hotspot ? regions

I highlighted the problem by writing a simple text instance of meathod:

public static int CopyFile()
{
    try
    {
    StringBuffer sb = null;
    String NullSpace = System.getProperty("line.separator");
    Writer output = new BufferedWriter(new FileWriter(outputFile));
    String line;
    BufferedReader input =  new BufferedReader(new FileReader(myFile));
while((line = input.readLine())!=null)
    {
        sb = new StringBuffer();
        //Parsing would happen
        sb.append(line);
        output.write(sb.toString()+NullSpace);
    }
        return 0;
    }
    catch (Exception e)
    {
        return 1;
    }
}

Can anyone advise how to fix this problem?

★ My decision

InputStream in = new FileInputStream(myFile);
        Reader reader = new InputStreamReader(in,"utf-8");
        Reader buffer = new BufferedReader(reader);
        Writer output = new BufferedWriter(new FileWriter(outputFile));
        int r;
        while ((r = reader.read()) != -1)
        {
            if (r<126)
            {
                output.write(r);
            }
            else
            {
                output.write("&#"+Integer.toString(r)+";");
            }
        }
        output.flush();

+3

java html-parsing special-characters bufferedreader bufferedwriter

Mikhail Aug 24 '10 at 17:50

source share

3 answers

, - , , FileReader . :

. , -. , InputStreamReader FileInputStream.

0

extraneon 24 . '10 18:00

Javadoc FileReader :

, . , InputStreamReader FileInputStream.

In your case, the default character encoding probably doesn't match . Find what encoding the input file uses and specify it. For instance:

FileInputStream fis = new FileInputStream(myFile);
InputStreamReader isr = new InputStreamReader(fis, "charset name goes here");
BufferedReader input = new BufferedReader(isr);

0

Richard Fearn Aug 24 '10 at 18:00

source share

Thierry-Dimitri Roy · Accepted Answer · 2010-08-24T17:54:02+0000

Reading the file does not match the same encoding (probably UTF-8) as the file (possibly ISO-8859-1).

Try creating a UTF-8 encoded file:

BufferedWriter output = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile),"UTF8"));

, . . Java:

Character corruption comes from BufferedReader to BufferedWriter in java

More articles: