Translation of text encoding in clojure

Question

Translation of text encoding in clojure

I would like to write a clojure function that takes a string in one encoding and converts it to another. The iconv library does this.

For example, let's look at the "è" symbol. In ISO-8859-1 ( http://www.ascii-code.com/ ), e8 as hex. In UTF-8 ( http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%C3%A8&mode=char ), c3 a8 .

So, let's say we have iso.txt that contains our letter and EOL:

 $ hexdump iso.txt 0000000 e8 0a 0000002

Now we can convert it to UTF-8 as follows:

 $ iconv -f ISO-8859-1 -t UTF-8 iso.txt | hexdump 0000000 c3 a8 0a 0000003

How do I write something equivalent in clojure? I am happy to use any external libraries, but I don’t know where I would go to find them. Looking around, I couldn't figure out how to use libiconv on the JVM, but is there probably an alternative?

Edit

After reading the Alex link in the comment, it is so simple and so cool:

 user> (new String (byte-array 2 (map unchecked-byte [0xc3 0xa8])) "UTF-8") "è" user> (new String (byte-array 1 [(unchecked-byte 0xe8)]) "ISO-8859-1") "è"

+4

encoding clojure

spike Sep 13 '13 at 19:38

source share

1 answer

Jared314 · Accepted Answer · 2013-09-13T20:26:22+0000

If you need a simple conversion of the whole file to UTF-8, slurp allows you to specify the encoding of the file with the option :encoding , and spit will output UTF-8 by default. This method will read the entire file into memory, so a different approach may be required for large files.

 $ printf "\xe8\n" > iso.txt $ hexdump iso.txt 0000000 e8 0a 0000002 (spit "/Users/path/iso2.txt" (slurp "/Users/path/iso.txt" :encoding "ISO-8859-1")) $ hexdump iso2.txt 0000000 c3 a8 0a 0000003

Note. slurp will read UTF-8 if you do not specify an encoding.

Translation of text encoding in clojure

More articles: