Classify data by value in pandas

I have a pandas.DataFrame form

 low_bound high_bound name 0 10 'a' 10 20 'b' 20 30 'c' 30 40 'd' 40 50 'e' 

I have a very long pandas.Series form:

 value 5.7 30.4 21 35.1 

I want to give each Series value its corresponding name relative to low_bound / high_bound / name DataFrame. Here is my expected result:

 value name 5.7 'a' 30.4 'd' 21 'c' 35.1 'd' 

In fact, the name 5.7 is 'a', since 5.7 is excluded between 0 and 10.

What will be the most efficient code? I know that I can solve the problem by repeating the series, but maybe there is a faster vector solution that eludes me.

Please note that my grades may be regular and irregular. Here they are regular for example.

+5
source share
1 answer

Pandas has a method called cut that will do what you want:

 import pandas as pd data = [{"low": 0, "high": 10, "name": "a"}, {"low": 10, "high": 20, "name": "b"}, {"low": 20, "high": 30, "name": "c"}, {"low": 30, "high": 40, "name": "d"}, {"low": 40, "high": 50, "name": "e"},] myDF = pd.DataFrame(data) #data to be binned mySeries = pd.Series([5.7, 30.4, 21, 35.1]) #create bins from original data bins = list(myDF["high"]) bins.insert(0,0) print pd.cut(mySeries, bins, labels = myDF["name"]) 

This will give you the following, which you can then return to some data frame or, nevertheless, want to save your data:

 0 a 1 d 2 c 3 d dtype: category Categories (5, object): [a < b < c < d < e] 

Depending on how irregular your bins are (and what you mean exactly by regular / irregular), you may have to resort to a loop cycle. I can't come up with the top of my head for an embedded device that will handle this for you, especially considering that it depends on the degree / type of irregularity in the boxes.

A cyclic method, this method will work if you have a lower and upper bound, regardless of "regularity":

 for el in mySeries: print myDF["name"][(myDF["low"] < el) & (myDF["high"] > el)] 

I appreciate that you may not need to loop through a huge series, but at least we do not manually index in the dataframe, which will probably slow down work even more

+4
source

Source: https://habr.com/ru/post/1246452/


All Articles