TypeError: sequence element 1: expected byte-like object, str found

I am trying to extract English names from a dump of wiki ticks that are in a text file using a regular expression in Python 3. A wiki dump contains names in other languages ​​and some characters. Below is my code:

with open('/Users/some/directory/title.txt', 'rb')as f:
    text=f.read()
    letters_only = re.sub(b"[^a-zA-Z]", " ", text)
    words = letters_only.lower().split() 
print(words)

But I get an error message:

TypeError: sequence item 1: expected a bytes-like object, str found 

in line: letters_only = re.sub(b"[^a-zA-Z]", " ", text)

But I use b''for output as a byte type, below is a sample text file:

Destroy-Oh-Boy!!
!!Que_Corra_La_Voz!!
!!_(chess)
!!_(disambiguation)
!'O!Kung
!'O!Kung_language
!'O-!khung_language
!337$P34K
!=
!?
!?!
!?Revolution!?
!?_(chess)
!A_Luchar!
!Action_Pact!
!Action_pact!
!Adios_Amigos!
!Alabadle!
!Alarma!
!Alarma!_(album)
!Alarma!_(disambiguation)
!Alarma!_(magazine)
!Alarma!_Records
!Alarma!_magazine
!Alfaro_Vive,_Carajo!
!All-Time_Quarterback!
!All-Time_Quarterback!_(EP)
!All-Time_Quarterback!_(album)
!Alla_tu!
!Amigos!
!Amigos!_(Arrested_Development_episode)
!Arriba!_La_Pachanga
!Ask_a_Mexican!
!Atame!
!Ay,_Carmela!_(film)
!Ay,_caramba!
!BANG!
!Bang!
!Bang!_TV
!Basta_Ya!
!Bastardos!
!Bastardos!_(album)
!Bastardos_en_Vivo!
!Bienvenido,_Mr._Marshall!
!Ciauetistico!
!Ciautistico!
!DOCTYPE
!Dame!_!Dame!_!Dame!
!Decapitacion!
!Dos!
!Explora!_Science_Center_and_Children's_Museum
!F
!Forward,_Russia!
!Forward_Russia!
!Ga!ne_language
!Ga!nge_language
!Gã!ne
!Gã!ne_language
!Gã!nge_language
!HERO
!Happy_Birthday_Guadaloupe!
!Happy_Birthday_Guadalupe!
!Hello_Friends

I searched the Internet but could not succeed. Any help would be appreciated.

+4
source share
3 answers

You need to choose between binary and text modes.

rb, re.sub(b"[^a-zA-Z]", b" ", text) (text - bytes)

r, re.sub("[^a-zA-Z]", " ", text) (text - str)

"".

+3

repl, , bytes:

letters_only = re.sub(b"[^a-zA-Z]", " ", b'Hello2World')
# TypeError: sequence item 1: expected a bytes-like object, str found

repl b" ":

letters_only = re.sub(b"[^a-zA-Z]", b" ", b'Hello2World')
print(letters_only) 
b'Hello World'

. b rb, byte.

+5

byte , .
, (byte string s). , string. , .

, ( string, byte):

with open('/Users/some/directory/title.txt', 'r')as f:
    text=f.read()
    letters_only = re.sub(r"[^a-zA-Z]", " ", text)
    words = letters_only.lower().split() 
print(words)

Please note that the code uses a special type of string for the regular expression - an unprocessed string with a prefix r. This means that python will not interpret escape characters such as \, which is very useful for regular expressions. For more information on raw strings, see.

+2
source

Source: https://habr.com/ru/post/1656521/


All Articles