How to effectively replace characters in an XML document in Java?

I am looking for a neat and efficient way to replace characters in an XML document. There is a note table defined for almost 12,000 UTF-8 characters, most of them should be replaced by single characters, but some should be replaced by two or even three characters (for example, Greek theta should become TH). Documents can be bulky (100 MB +). How to do it in Java? I came up with the idea of ​​using XSLT, but I'm not sure if this is the best option.

+3
source share
2 answers

String.replace (..) is very slow, based on my experience. I used to analyze 100 MB KML files using this API, and the performance is just poor. Then I precompiled the regex with Pattern.compile (..) and worked much faster.

+3
source

Take a look at SAX, which allows you to see every single part of an XML document as you go through it. Then you can act on the text nodes and perform the necessary manipulations.

The problem with XSLT is that most implementations require an entire in-memory input tree, which is typically 10 times the size of the disk. I only know the commercial version of the Saxon XSLT transformer that can perform XSLT streaming (but that would be ideal for your needs).

0

Source: https://habr.com/ru/post/1746181/


All Articles