Read_table in pandas how to get input from text in dataframe

Question

Read_table in pandas how to get input from text in dataframe

Alabama[edit] Auburn (Auburn University)[1] Florence (University of North Alabama) Jacksonville (Jacksonville State University)[2] Alaska[edit] Fairbanks (University of Alaska Fairbanks)[2] Arizona[edit] Flagstaff (Northern Arizona University)[6] Tempe (Arizona State University) Tucson (University of Arizona)

This is my text, I need to create a data frame with 1 column for the name of the state, and another column for the name of the city, I know how to delete the names of universities. but how can I tell pandas that with each [edit] this is a new state.

expected output data block

 Alabama Auburn Alabama Florence Alabama Jacksonville Alaska Fairbanks Arizona Flagstaff Arizona Tempe Arizona Tucson

I'm not sure if I can use read_table if I can, how can I do this? I imported everything into a dataframe, but the state and city are in the same column. I also tried with a list, but the problem is still the same.

I need something that works if the line has [edit], and then the whole value after it and before the next line [edit] is the state of the lines between

0

python python-3.x pandas sklearn-pandas

lucarlig Nov 04 '16 at 0:17

source share

2 answers

Using Pandas, you can do the following:

 import pandas as pd df = pd.read_table('data', sep='\n', header=None, names=['town']) df['is_state'] = df['town'].str.contains(r'\[edit\]') df['groupno'] = df['is_state'].cumsum() df['index'] = df.groupby('groupno').cumcount() df['state'] = df.groupby('groupno')['town'].transform('first') df['state'] = df['state'].str.replace(r'\[edit\]', '') df['town'] = df['town'].str.replace(r' \(.+$', '') df = df.loc[~df['is_state']] df = df[['state','town']]

what gives

  state town 1 Alabama Auburn 2 Alabama Florence 3 Alabama Jacksonville 5 Alaska Fairbanks 7 Arizona Flagstaff 8 Arizona Tempe 9 Arizona Tucson

Here is a breakdown of what the code does. After loading the text file into the DataFrame, use str.contains to identify the rows that are states. Use cumsum to get the total sum of True / False values, where True is treated as 1 and False as 0.

 df = pd.read_table('data', sep='\n', header=None, names=['town']) df['is_state'] = df['town'].str.contains(r'\[edit\]') df['groupno'] = df['is_state'].cumsum() # town is_state groupno # 0 Alabama[edit] True 1 # 1 Auburn (Auburn University)[1] False 1 # 2 Florence (University of North Alabama) False 1 # 3 Jacksonville (Jacksonville State University)[2] False 1 # 4 Alaska[edit] True 2 # 5 Fairbanks (University of Alaska Fairbanks)[2] False 2 # 6 Arizona[edit] True 3 # 7 Flagstaff (Northern Arizona University)[6] False 3 # 8 Tempe (Arizona State University) False 3 # 9 Tucson (University of Arizona) False 3

Now for each groupno number we can assign a unique integer to each row in the group:

 df['index'] = df.groupby('groupno').cumcount() # town is_state groupno index # 0 Alabama[edit] True 1 0 # 1 Auburn (Auburn University)[1] False 1 1 # 2 Florence (University of North Alabama) False 1 2 # 3 Jacksonville (Jacksonville State University)[2] False 1 3 # 4 Alaska[edit] True 2 0 # 5 Fairbanks (University of Alaska Fairbanks)[2] False 2 1 # 6 Arizona[edit] True 3 0 # 7 Flagstaff (Northern Arizona University)[6] False 3 1 # 8 Tempe (Arizona State University) False 3 2 # 9 Tucson (University of Arizona) False 3 3

Again for each groupno number groupno we can find the state by selecting the first city in each group:

 df['state'] = df.groupby('groupno')['town'].transform('first') # town is_state groupno index state # 0 Alabama[edit] True 1 0 Alabama[edit] # 1 Auburn (Auburn University)[1] False 1 1 Alabama[edit] # 2 Florence (University of North Alabama) False 1 2 Alabama[edit] # 3 Jacksonville (Jacksonville State University)[2] False 1 3 Alabama[edit] # 4 Alaska[edit] True 2 0 Alaska[edit] # 5 Fairbanks (University of Alaska Fairbanks)[2] False 2 1 Alaska[edit] # 6 Arizona[edit] True 3 0 Arizona[edit] # 7 Flagstaff (Northern Arizona University)[6] False 3 1 Arizona[edit] # 8 Tempe (Arizona State University) False 3 2 Arizona[edit] # 9 Tucson (University of Arizona) False 3 3 Arizona[edit]

Basically we have the desired DataFrame; all that remains is the result prefix. We can remove [edit] from state and everything after the first bracket from town with str.replace :

 df['state'] = df['state'].str.replace(r'\[edit\]', '') df['town'] = df['town'].str.replace(r' \(.+$', '')

Remove the lines in which town is actually a state:

 df = df.loc[~df['is_state']]

And finally, save only the columns you need:

 df = df[['state','town']]

+2

unutbu Nov 04 '16 at 1:13

source share

furas · Accepted Answer · 2016-11-04T00:44:24+0000

Maybe pandas can do it, but you can do it easily.

 data = '''Alabama[edit] Auburn (Auburn University)[1] Florence (University of North Alabama) Jacksonville (Jacksonville State University)[2] Alaska[edit] Fairbanks (University of Alaska Fairbanks)[2] Arizona[edit] Flagstaff (Northern Arizona University)[6] Tempe (Arizona State University) Tucson (University of Arizona)''' # --- result = [] state = None for line in data.split('\n'): if line.endswith('[edit]'): # remember new state state = line[:-6] # without `[edit]` else: # add state, city to result city, rest = line.split(' ', 1) result.append( [state, city] ) # --- display --- for state, city in result: print(state, city)

if you are reading a file and then

 result = [] state = None with open('your_file') as f: for line in f: line = line.strip() # remove '\n' if line.endswith('[edit]'): # remember new state state = line[:-6] # without `[edit]` else: # add state, city to result city, rest = line.split(' ', 1) result.append( [state, city] ) # --- display --- for state, city in result: print(state, city)

Now you can use result to create a DataFrame .

Read_table in pandas how to get input from text in dataframe

More articles: