I have DataFramefor x and y. I would like to calculate how often each event occurs in DataFrameand what percentage of occurrence :y, which is a combination. Now I have the first part, thanks to the previous question .
using DataFrames
mydf = DataFrame(y = rand('a':'h', 1000), x = rand('i':'p', 1000))
mydfsum = by(mydf, [:x, :y], df -> DataFrame(n = length(df[:x])))
This successfully creates a column that counts how often each value :xoccurs with each value :y. Now I need to create a new column that counts how often each value happens :y. I could create a new DataFrameone using:
mydfsumy = by(mydf, [:y], df -> DataFrame(ny = length(df[:x])))
Join DataFramestogether.
mydfsum = join(mydfsum, mydfsumy, on = :y)
And create a percentage column :yp
mydfsum[:yp] = mydfsum[:n] ./ mydfsum[:ny]
. R , dplyr:
mydf %>% groupby(x,y) %>% summarize(n = n()) %>% groupby(y) %>% mutate(yp = n/sum(n))