How to store non-English string in excel file, python3?

I have a simple project for scraping reviews from a travel site and store it in an excel file. Reviews can be in Spanish, Japanese or any other language, and some reviews sometimes contain special characters, such as "❤❤".

I need to save all the data (special characters can be excluded if they cannot be written).

I can clear the data I want and print it on the console as it is (for example, Japanese text), but the problem is to save it in a csv file, it shows an error message as shown below.

I tried to open the file using utf-8 encoding (as mentioned in the comment below), but then it saves the data in some strange characters that do not make sense .... and could not find the answer to this problem. Any suggestions.

I am using python 3.5.3

My code for python:

from selenium import webdriver from bs4 import BeautifulSoup import time import re file = "TajMahalSpanish.csv" f = open(file, "w") headers = "rating, title, review\n" f.write(headers) pages = 119 pageNumber = 2 option = webdriver.ChromeOptions() option.add_argument("--incognito") browser = webdriver.Chrome(executable_path='C:\Program Files\JetBrains\PyCharm Community Edition 2017.1.5\chrome webdriver\chromedriver', chrome_options=option) browser.get("https://www.tripadvisor.in/Attraction_Review-g297683-d317329-Reviews-Taj_Mahal-Agra_Agra_District_Uttar_Pradesh.html") time.sleep(10) browser.find_element_by_xpath('//*[@id="taplc_location_review_filter_controls_0_form"]/div[4]/ul/li[5]/a').click() time.sleep(5) browser.find_element_by_xpath('//*[@id="BODY_BLOCK_JQUERY_REFLOW"]/span/div[1]/div/form/ul/li[2]/label').click() time.sleep(5) while (pages): html = browser.page_source soup = BeautifulSoup(html, "html.parser") containers = soup.find_all("div",{"class":"innerBubble"}) showMore = soup.find("span", {"onclick": "widgetEvCall('handlers.clickExpand',event,this);"}) if showMore: browser.find_element_by_xpath("//span[@onclick=\"widgetEvCall('handlers.clickExpand',event,this);\"]").click() time.sleep(3) html = browser.page_source soup = BeautifulSoup(html, "html.parser") containers = soup.find_all("div", {"class": "innerBubble"}) showMore = False for container in containers: bubble = container.div.div.span["class"][1] title = container.div.find("div", {"class": "quote"}).a.span.text review = container.find("p", {"class": "partial_entry"}).text f.write(bubble + "," + title.replace(",", "|").replace("\n", "...") + "," + review.replace(",", "|").replace("\n", "...") + "\n") print(bubble) print(title) print(review) browser.find_element_by_xpath("//div[@class='ppr_rup ppr_priv_location_reviews_list']//div[@class='pageNumbers']/span[@data-page-number='" + str(pageNumber) + "']").click() time.sleep(5) pages -= 1 pageNumber += 1 f.close() 

I get the following error:

 Traceback (most recent call last): File "C:/Users/Akshit/Documents/pycharmProjects/spanish.py", line 45, in <module> f.write(bubble + "," + title.replace(",", "|").replace("\n", "...") + "," + review.replace(",", "|").replace("\n", "...") + "\n") File "C:\Users\Akshit\AppData\Local\Programs\Python\Python35\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 10-18: character maps to <undefined> Process finished with exit code 1 

UPDATE

I am trying to find a solution to this problem. In the end, I need to translate Japanese reviews into English, as well as for research, so I can use one of the google api to translate the string in the code itself before writing it, and then write it to the csv file ...

+5
source share
1 answer

UPDATE

Found solution in

Is it possible to force Excel to recognize CSV UTF-8 files automatically?

as suggested by @ MaartenFabré in the comments.

Basically from what I understood, the problem is that the Excel file has problems reading the csv file using utf-8 encoding, so when I directly open the csv file (made through python) using Excel ... all the data damaged.

The solution is this:

  • I saved the data in a text file, not csv in python
  • Open excel
  • To import external data and import using a txt file
  • select the file type as "limited" and the initial file as "650001: Unicode (utf-8)"
  • Select "," as the separator (your choice) and import
  • Data is correctly displayed in excel in the correct rows and columns for each language ... japenese, spanish, french, etc.

Thanks again @MaartenFabre for the help!

0
source

Source: https://habr.com/ru/post/1270533/


All Articles