Using Pandas, you can do the following:
import pandas as pd df = pd.read_table('data', sep='\n', header=None, names=['town']) df['is_state'] = df['town'].str.contains(r'\[edit\]') df['groupno'] = df['is_state'].cumsum() df['index'] = df.groupby('groupno').cumcount() df['state'] = df.groupby('groupno')['town'].transform('first') df['state'] = df['state'].str.replace(r'\[edit\]', '') df['town'] = df['town'].str.replace(r' \(.+$', '') df = df.loc[~df['is_state']] df = df[['state','town']]
what gives
state town 1 Alabama Auburn 2 Alabama Florence 3 Alabama Jacksonville 5 Alaska Fairbanks 7 Arizona Flagstaff 8 Arizona Tempe 9 Arizona Tucson
Here is a breakdown of what the code does. After loading the text file into the DataFrame, use str.contains to identify the rows that are states. Use cumsum to get the total sum of True / False values, where True is treated as 1 and False as 0.
df = pd.read_table('data', sep='\n', header=None, names=['town']) df['is_state'] = df['town'].str.contains(r'\[edit\]') df['groupno'] = df['is_state'].cumsum() # town is_state groupno # 0 Alabama[edit] True 1 # 1 Auburn (Auburn University)[1] False 1 # 2 Florence (University of North Alabama) False 1 # 3 Jacksonville (Jacksonville State University)[2] False 1 # 4 Alaska[edit] True 2 # 5 Fairbanks (University of Alaska Fairbanks)[2] False 2 # 6 Arizona[edit] True 3 # 7 Flagstaff (Northern Arizona University)[6] False 3 # 8 Tempe (Arizona State University) False 3 # 9 Tucson (University of Arizona) False 3
Now for each groupno number we can assign a unique integer to each row in the group:
df['index'] = df.groupby('groupno').cumcount()
Again for each groupno number groupno we can find the state by selecting the first city in each group:
df['state'] = df.groupby('groupno')['town'].transform('first')
Basically we have the desired DataFrame; all that remains is the result prefix. We can remove [edit] from state and everything after the first bracket from town with str.replace :
df['state'] = df['state'].str.replace(r'\[edit\]', '') df['town'] = df['town'].str.replace(r' \(.+$', '')
Remove the lines in which town is actually a state:
df = df.loc[~df['is_state']]
And finally, save only the columns you need:
df = df[['state','town']]