You can first read_csv
with the name
parameter to create a DataFrame
with the Region Name
column, the separator is a value that is NOT in the values ββ(for example DataFrame
):
df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])
Then insert
new State
column with extract
, where the text is [edit]
and replace
all the values ββfrom (
to the end to the Region Name
column.
df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill()) df['Region Name'] = df['Region Name'].str.replace(r' \(.+$', '')
The last to delete the lines where the text is [edit]
boolean indexing
, the mask is created by str.contains
:
df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True) print (df) State Region Name 0 Alabama Auburn 1 Alabama Florence 2 Alabama Jacksonville 3 Alabama Livingston 4 Alabama Montevallo 5 Alabama Troy 6 Alabama Tuscaloosa 7 Alabama Tuskegee 8 Alaska Fairbanks 9 Arizona Flagstaff 10 Arizona Tempe 11 Arizona Tucson
If a solution to all values ββis required, it is simpler:
df = pd.read_csv('filename.txt', sep=";", names=['Region Name']) df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill()) df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True) print (df) State Region Name 0 Alabama Auburn (Auburn University)[1] 1 Alabama Florence (University of North Alabama) 2 Alabama Jacksonville (Jacksonville State University)[2] 3 Alabama Livingston (University of West Alabama)[2] 4 Alabama Montevallo (University of Montevallo)[2] 5 Alabama Troy (Troy University)[2] 6 Alabama Tuscaloosa (University of Alabama, Stillman Co... 7 Alabama Tuskegee (Tuskegee University)[5] 8 Alaska Fairbanks (University of Alaska Fairbanks)[2] 9 Arizona Flagstaff (Northern Arizona University)[6] 10 Arizona Tempe (Arizona State University) 11 Arizona Tucson (University of Arizona)
source share