Using Pandas, you can do the following:
import pandas as pd df = pd.read_table('data', sep='\n', header=None, names=['town']) df['is_state'] = df['town'].str.contains(r'\[edit\]') df['groupno'] = df['is_state'].cumsum() df['index'] = df.groupby('groupno').cumcount() df['state'] = df.groupby('groupno')['town'].transform('first') df['state'] = df['state'].str.replace(r'\[edit\]', '') df['town'] = df['town'].str.replace(r' \(.+$', '') df = df.loc[~df['is_state']] df = df[['state','town']]
what gives
state town 1 Alabama Auburn 2 Alabama Florence 3 Alabama Jacksonville 5 Alaska Fairbanks 7 Arizona Flagstaff 8 Arizona Tempe 9 Arizona Tucson
Here is a breakdown of what the code does. After loading the text file into the DataFrame, use str.contains
to identify the rows that are states. Use cumsum
to get the total sum of True / False values, where True is treated as 1 and False as 0.
df = pd.read_table('data', sep='\n', header=None, names=['town']) df['is_state'] = df['town'].str.contains(r'\[edit\]') df['groupno'] = df['is_state'].cumsum() # town is_state groupno # 0 Alabama[edit] True 1 # 1 Auburn (Auburn University)[1] False 1 # 2 Florence (University of North Alabama) False 1 # 3 Jacksonville (Jacksonville State University)[2] False 1 # 4 Alaska[edit] True 2 # 5 Fairbanks (University of Alaska Fairbanks)[2] False 2 # 6 Arizona[edit] True 3 # 7 Flagstaff (Northern Arizona University)[6] False 3 # 8 Tempe (Arizona State University) False 3 # 9 Tucson (University of Arizona) False 3
Now for each groupno
number we can assign a unique integer to each row in the group:
df['index'] = df.groupby('groupno').cumcount()
Again for each groupno
number groupno
we can find the state by selecting the first city in each group:
df['state'] = df.groupby('groupno')['town'].transform('first')
Basically we have the desired DataFrame; all that remains is the result prefix. We can remove [edit]
from state
and everything after the first bracket from town
with str.replace
:
df['state'] = df['state'].str.replace(r'\[edit\]', '') df['town'] = df['town'].str.replace(r' \(.+$', '')
Remove the lines in which town
is actually a state:
df = df.loc[~df['is_state']]
And finally, save only the columns you need:
df = df[['state','town']]