ReadHTMLTable and UTF-8 Encoding

I have an encoding problem with readHTMLTable and the XML package in general. I would like to download some tables from polish allegro.pl (an auction site similar to ebay), but after that there is an encoding problem with Polish fonts, even if I used the encoding = "UTF-8" or stringsAsFactors = F in readHTMLTable.

the code:

library(XML) url<-paste("http://allegro.pl/listing.php/search?category=15821&sg=0&p=",1:5,"&string=facebook",sep="") alldata<-NULL for (i in 1:5){ dane<-as.data.frame(readHTMLTable(url[i],1,stringsAsFactors = TRUE, encoding="UTF-8")$lista) alldata<-rbind(alldata,dane) } 

Result:

 > head(alldata[,c(2,3)]) V2 V3 1 Facebook Fan Page z ANIMACJĄ indywidualny projekt Kup Teraz! 150,00 zł 2 Lubię to! Facebook! OKAZJA!!! 160 FANĂÂ"W!!! ZOBACZ! Kup Teraz! 10,99 zł 3 125 fanĂÂłw fani like fanpage FACEBOOK polskie konta Kup Teraz! 10,00 zł 4 Reklama Fanpage 43500+ fanĂÂłw, fani, facebook Efekt Kup Teraz! 17,99 zł 5 Facebook Fanpage -Stworzenie Profesjonalnego Konta Kup Teraz! 77,90 zł 6 Facebook Fanpage -Skuteczna Obsługa/Reklama /FV Kup Teraz! 100,00 zł 

If I use getURL or readLines, there is no problem, but I want to use the XML package to make it cool :)

There is always this problem when I use XML package features like htmlParse, xpathApply or the mentioned readHTMLTable.

I am working on Rstudio 0.94.110 @ Windows7. SessionInfo below.

 R version 2.14.0 (2011-10-31) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=Polish_Poland.1250 LC_CTYPE=Polish_Poland.1250 LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C LC_TIME=Polish_Poland.1250 attached base packages: [1] splines stats graphics grDevices utils datasets methods base other attached packages: [1] spdep_0.5-41 coda_0.14-6 deldir_0.0-16 maptools_0.8-10 foreign_0.8-46 nlme_3.1-102 Matrix_1.0-1 lattice_0.20-0 boot_1.3-3 [10] sp_0.9-91 maps_2.2-2 RCurl_1.7-0.1 bitops_1.0-4.1 XML_3.4-2.2 Cairo_1.5-1 car_2.0-11 survival_2.36-10 nnet_7.3-1 [19] MASS_7.3-16 loaded via a namespace (and not attached): [1] grid_2.14.0 tools_2.14.0 
+6
source share
1 answer

for some time I sent an email with Duncan Temple Lang, creator of the XML package. Yesterday (January 30, 2012), he uploaded a new version of the XML package on the Omegahat website. The new version 3.9-4 for the 31-bit version of R removes this encoding problem! :)

Download the form form below: http://www.omegahat.org/R/bin/windows/contrib/2.14/

 library(XML) url<-paste("http://allegro.pl/listing.php/search?category=15821&sg=0&p=",1:5,"&string=facebook",sep="") doc = htmlParse(url[1], encoding = "UTF-8") z = as.data.frame(readHTMLTable(doc, stringsAsFactors = FALSE)$lista) 

This works, so we can close this topic. :)

+4
source

Source: https://habr.com/ru/post/906610/


All Articles