Help with R and grouping / aggregate / * apply / data.table

I am very new to R and have problems running functions to get the answers I need. I have a sample PCSTest data

http://pastebin.com/z9Ti3nHB

It looks something like this:

Date Site Word -------------------------------------- 9/1/2012 slashdot javascript 9/1/2012 stackexchange R 9/1/2012 reddit R 9/1/2012 slashdot javascript 9/1/2012 stackexchange javascript 9/5/2012 reddit R 9/8/2012 slashdot javascript 9/8/2012 stackexchange R 9/8/2012 reddit R 9/8/2012 slashdot javascript 9/18/2012 stackexchange R 9/18/2012 reddit R 9/18/2012 slashdot javascript 9/18/2012 stackexchange R 9/27/2012 reddit R 9/27/2012 slashdot R 

My goal is to look for trends in occurrences of different words, as they relate to sites over time. I can count them:

 library(plyr) PCSTest <- read.csv(file="c:/PCS/PCS Data - Test.csv", header=TRUE) PCSTest$Date <- as.Date(PCSTest$Date, "%m/%d/%Y") PCSTest$Date <- as.POSIXct(PCSTest$Date) countTest <- count(PCSTest, c("Date", "Site", "Word")) 

which gives the following:

  Date Site Word freq 1 2012-08-31 20:00:00 reddit R 4 2 2012-08-31 20:00:00 slashdot javascript 7 3 2012-08-31 20:00:00 stackexchange javascript 1 4 2012-08-31 20:00:00 stackexchange R 2 5 2012-09-01 20:00:00 reddit javascript 2 6 2012-09-01 20:00:00 slashdot R 3 7 2012-09-04 20:00:00 reddit R 1 8 2012-09-07 20:00:00 reddit R 1 9 2012-09-07 20:00:00 slashdot javascript 2 10 2012-09-07 20:00:00 stackexchange R 1 11 2012-09-09 20:00:00 stackexchange javascript 4 12 2012-09-10 20:00:00 slashdot R 4 13 2012-09-14 20:00:00 reddit javascript 4 14 2012-09-17 20:00:00 reddit R 4 15 2012-09-17 20:00:00 slashdot javascript 1 16 2012-09-17 20:00:00 stackexchange R 2 17 2012-09-19 20:00:00 reddit javascript 2 18 2012-09-23 20:00:00 stackexchange javascript 2 19 2012-09-24 20:00:00 reddit javascript 3 20 2012-09-24 20:00:00 stackexchange javascript 1 21 2012-09-24 20:00:00 stackexchange R 4 22 2012-09-25 20:00:00 reddit javascript 5 23 2012-09-25 20:00:00 slashdot javascript 3 24 2012-09-25 20:00:00 stackexchange R 7 25 2012-09-26 20:00:00 reddit R 1 26 2012-09-26 20:00:00 slashdot R 5 

or write them all down:

 library(ggplot2) ggplot(data=countTest, aes(x=Date, y=freq, group=interaction(Site, Word), colour=interaction(Site, Word), shape=Site)) + geom_line() + geom_point() 

My plot of Frequency per day for Words per Site

I need to do some calculations from the data now, so I tried aggregate

 aggregate(freq ~ Site + Word, data = countTest, function(freq) cbind(mean(freq), max(freq)))[order(-agg$freq[,3]),] 

which gives:

  Site Word freq.1 freq.2 2 slashdot javascript 3.25 7.00 5 slashdot R 4.00 5.00 1 reddit javascript 3.20 5.00 4 reddit R 2.20 4.00 6 stackexchange R 3.20 7.00 3 stackexchange javascript 2.00 4.00 

What I would like in this last result is a column with an average frequency per day, for example ... sum (freq) / 20 days, calculated from the data, perhaps even an average moving average. Also, I need another slope / linear regression column. How would I calculate this in an aggregate function?

Or, how can I do it better / faster? I know that there are applications and data.table functions, but I don’t see how to use them. Any help would be greatly appreciated!

+5
source share
1 answer

I'm not sure what you want to do for sure, but dplyr (or plyr ) will help you. Here are some examples. If you clearly indicate what you want, you will get more tips.

 d <- read.csv("~/Downloads/r_data.txt") d$Date <- as.POSIXct(as.Date(d$Date, "%m/%d/%Y")) library(dplyr) d.cnt <- d %>% group_by(Date, Site, Word) %>% summarise(cnt = n()) # average per day date.range <- d$Date %>% range %>% diff %>% as.numeric # gives 26 days or date.range <- d$Date %>% unique %>% length # gives 13 days d.ave <- d.cnt %>% group_by(Site, Word) %>% summarize(ave_per_day = sum(cnt)/date.range) # slope d.reg <- d.cnt %>% group_by(Site, Word) %>% do({fit = lm(cnt ~ Date, data = .); data.frame(int = coef(fit)[1], slope = coef(fit)[2])}) # plot the slope value library(ggplot2) ggplot(d.reg, aes(Site, slope, fill = Word)) + geom_bar(stat = "identity", position = "dodge") 
+1
source

Source: https://habr.com/ru/post/1202106/


All Articles