I have a simple project for scraping reviews from a travel site and store it in an excel file. Reviews can be in Spanish, Japanese or any other language, and some reviews sometimes contain special characters, such as "❤❤".
I need to save all the data (special characters can be excluded if they cannot be written).
I can clear the data I want and print it on the console as it is (for example, Japanese text), but the problem is to save it in a csv file, it shows an error message as shown below.
I tried to open the file using utf-8 encoding (as mentioned in the comment below), but then it saves the data in some strange characters that do not make sense .... and could not find the answer to this problem. Any suggestions.
I am using python 3.5.3
My code for python:
from selenium import webdriver from bs4 import BeautifulSoup import time import re file = "TajMahalSpanish.csv" f = open(file, "w") headers = "rating, title, review\n" f.write(headers) pages = 119 pageNumber = 2 option = webdriver.ChromeOptions() option.add_argument("--incognito") browser = webdriver.Chrome(executable_path='C:\Program Files\JetBrains\PyCharm Community Edition 2017.1.5\chrome webdriver\chromedriver', chrome_options=option) browser.get("https://www.tripadvisor.in/Attraction_Review-g297683-d317329-Reviews-Taj_Mahal-Agra_Agra_District_Uttar_Pradesh.html") time.sleep(10) browser.find_element_by_xpath('//*[@id="taplc_location_review_filter_controls_0_form"]/div[4]/ul/li[5]/a').click() time.sleep(5) browser.find_element_by_xpath('//*[@id="BODY_BLOCK_JQUERY_REFLOW"]/span/div[1]/div/form/ul/li[2]/label').click() time.sleep(5) while (pages): html = browser.page_source soup = BeautifulSoup(html, "html.parser") containers = soup.find_all("div",{"class":"innerBubble"}) showMore = soup.find("span", {"onclick": "widgetEvCall('handlers.clickExpand',event,this);"}) if showMore: browser.find_element_by_xpath("//span[@onclick=\"widgetEvCall('handlers.clickExpand',event,this);\"]").click() time.sleep(3) html = browser.page_source soup = BeautifulSoup(html, "html.parser") containers = soup.find_all("div", {"class": "innerBubble"}) showMore = False for container in containers: bubble = container.div.div.span["class"][1] title = container.div.find("div", {"class": "quote"}).a.span.text review = container.find("p", {"class": "partial_entry"}).text f.write(bubble + "," + title.replace(",", "|").replace("\n", "...") + "," + review.replace(",", "|").replace("\n", "...") + "\n") print(bubble) print(title) print(review) browser.find_element_by_xpath("//div[@class='ppr_rup ppr_priv_location_reviews_list']//div[@class='pageNumbers']/span[@data-page-number='" + str(pageNumber) + "']").click() time.sleep(5) pages -= 1 pageNumber += 1 f.close()
I get the following error:
Traceback (most recent call last): File "C:/Users/Akshit/Documents/pycharmProjects/spanish.py", line 45, in <module> f.write(bubble + "," + title.replace(",", "|").replace("\n", "...") + "," + review.replace(",", "|").replace("\n", "...") + "\n") File "C:\Users\Akshit\AppData\Local\Programs\Python\Python35\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 10-18: character maps to <undefined> Process finished with exit code 1
UPDATE
I am trying to find a solution to this problem. In the end, I need to translate Japanese reviews into English, as well as for research, so I can use one of the google api to translate the string in the code itself before writing it, and then write it to the csv file ...