Pandas Group of Highest Values Dataframe

Question

Pandas Group of Highest Values Dataframe

I have a pandas framework with two columns (snippet below). I am trying to use the City column to display Borough (you will notice some Unspecified values that need to be replaced). To do this, I try to show each city the highest city and display it in a dictionary where the key will be a city and the value will be the highest place for this city.

City Borough Brooklyn Brooklyn Astoria Queens Astoria Unspecified Ridgewood Unspecified Ridgewood Queens

So, if Ridgewood is paired with Queens 100 times, Brooklyn 4 times and Manhattan 1 time, the couple will be Ridgewood: Queens.

So far I have tried this code:

 specified = data[['Borough','City']][data['Borough']!= 'Unspecified'] paired = specified.Borough.groupby(specified.City).max()

At first glance, this looked like the right conclusion, but after a more thorough examination, the exit was incorrect. Any ideas?

EDIT:

I tried the following sentence: paired = specified .groupby ("City"). agg (lambda x: stats.mode (x ['Borough']) [0])

I noticed that some of the Boroughs came out of truncated, as shown below:

 paired.Borough.value_counts() #[Out]# QUEENS 58 #[Out]# MANHATTAN 7 #[Out]# STATEN ISLAND 4 #[Out]# BRONX 4 #[Out]# BROOKLYN 3 #[Out]# MANHATTA 2 #[Out]# STATE 1 #[Out]# QUEEN 1 #[Out]# MANHA 1 #[Out]# BROOK 1

Of course, I can simply manually replace the truncated words, but I am curious to find out what is the reason?

PS - Here is the DF output indicated by FYI:

 specified #[Out]# <class 'pandas.core.frame.DataFrame'> #[Out]# Int64Index: 719644 entries, 1 to 396225 #[Out]# Data columns: #[Out]# Borough 719644 non-null values #[Out]# City 651617 non-null values #[Out]# dtypes: object(2) specified.Borough.value_counts() #[Out]# QUEENS 215382 #[Out]# BROOKLYN 208565 #[Out]# MANHATTAN 150016 #[Out]# BRONX 94648 #[Out]# STATEN ISLAND 51033

+4

python pandas

Chrisarmrmrong Nov 19 '12 at 2:12

source share

1 answer

Brenbarn · Answer 1 · 2012-11-19T02:25:26+0000

I believe this will do:

 from scipy import stats d.groupby('City').agg(lambda x: stats.mode(x['Borough'])[0])

This gives you a DataFrame with a city as an index and the most common area in the Borough column:

 >>> d City Borough 0 Brooklyn Brooklyn 1 Astoria Queens 2 Astoria Queens 3 Astoria Brooklyn 4 Astoria Unspecified 5 Ridgewood Unspecified 6 Ridgewood Queens 7 Ridgewood Queens 8 Ridgewood Brooklyn 9 Ridgewood Brooklyn 10 Ridgewood Brooklyn >>> d.groupby('City').agg(lambda x: stats.mode(x['Borough'])[0]) Borough City Astoria Queens Brooklyn Brooklyn Ridgewood Brooklyn

(If you don’t have Scipy installed, you will need to create your own "mode" function, which I think you could use with collections.Counter . But if you use pandas, it would be nice to argue. I also have Scipy. )

Pandas Group of Highest Values ​​Dataframe

More articles:

Pandas Group of Highest Values Dataframe