Well, it took a while, but I think I figured it out.
It turns out that the web page is incorrectly encoded. He claims to be βISO-8859-1,β but on some pages there is a trademark character encoded as \x99 , which means that he probably really uses the βWindows-1252β codepage. This character outside the normal ASCII range starts multibyte character reading, and the file quickly becomes corrupted.
As far as I can tell, RCURL does not support this encoding natively. But you can still download the file as binary data, and then convert it with iconv , which has more conversion options. This should work
raw <- lapply(links, getURLContent, binary=TRUE) pages <- lapply(lapply(raw,readBin,"characer"), iconv, from="WINDOWS-1252", to="UTF-8")
Now I tested this on my Mac. Exact values ββfrom / to rows may vary by platform. Check the list from iconvlist() for a possible replacement for from= if this does not work on your computer.
source share