I would like to clear the following webpage:
http://www.oricon.co.jp/rank/js/w/2017-01-16/p/4/
But there are some encoding issues:
library(rvest)
URL = 'http://www.oricon.co.jp/rank/js/w/2017-01-16/p/4/'
read_html(URL)
Error in eval(substitute(expr), envir, enclos): input conversion failed due to input error, bytes 0xFA 0xB1 0x90 0xE7 [6003]
The page is clearly written in Japanese; the first three pages do not require coding, for example:
read_html('http://www.oricon.co.jp/rank/js/w/2017-01-16/p/2/')
# {xml_document}
# <html>
# [1] <head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb#">\n <meta charset="shi ...
# [2] <body id="container"> \n<script src="//ajax.googleapis.com/ajax/libs/jquery/1.11 ...
This page behaves as expected, and I can easily get the content I want.
I tried to be more explicit with the encoding using the approach proposed by Hadley in rvestProblem Tracker here :
library(httr)
guess_encoding(content(GET(URL), 'raw'))
None of them are right. The language is incorrect for everyone except Shift_JISand EUC-JP, but both of them cause similar errors (incomprehensible byte codes are different):
read_html(URL, encoding = 'Shift_JIS')
Error in eval(substitute(expr), envir, enclos): input conversion failed due to input error, bytes 0xFA 0xB1 0x90 0xE7 [6003]
read_html(URL, encoding = 'EUC-JP')
eval(substitute(expr), envir, enclos): - , 0x8F 0x54 0x8A 0xD4 [6003]
, , gobbledygook:
cell.xp = '//div[@class="inner" and ./a[contains(@href, "prof")]]'
read_html(URL, encoding = 'ISO-8859-1') %>%
html_nodes(xpath = cell.xp) %>% html_nodes('h2') %>%
html_text %>% tail(6)
# [1] "\u0082æ\u0082Ñ\u0082·\u0082Ä"
# [2] "\u0083n\u0083b\u0083s\u0081[\u0083G\u0083\u0093\u0083h"
# [3] "THE IDOLM@STER CINDERELLA GIRLS STARLIGHT MASTER 07 \u0083T\u0083}\u0083J\u0083j!!"
# [4] "Dear Bride"
# [5] "Fantastic Time"
# [6] "Hey Ho"
, // /.
, guess_encoding , ? , , , ? 6003 ?
sessionInfo :