Effective use of functions on long data frames in R

I have a long data frame containing meteorological data from the mast. It contains observations ( data$value ) taken simultaneously with different parameters (wind speed, direction, air temperature, etc. In data$param ) at different heights ( data$z )

I am trying to efficiently slice this data into $time , and then apply the functions to all the collected data. Usually functions are applied to one $param at a time (i.e. I apply different functions to wind speed than to air temperature).

Current approach

My current method is to use data.frame and ddply .

If I want to get all the wind speed data, I run this:

 # find good data ---- df <- data[((data$param == "wind speed") & !is.na(data$value)),] 

Then I run my function on df using ddply() :

 df.tav <- ddply(df, .(time), function(x) { y <-data.frame(V1 = sum(x$value) + sum(x$z), V2 = sum(x$value) / sum(x$z)) return(y) }) 

Typically, V1 and V2 are calls to other functions. These are just examples. However, I need to run several functions on the same data.

Question

My current approach is very slow. I did not compare it, but it is slow enough, I can go for a cup of coffee and return before the data for the year is processed.

I have an order (hundreds) of towers for processing, each with a year of data and 10-12 heights, and therefore I am looking for something faster.

Sample data

 data <- structure(list(time = structure(c(1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262304600, 1262305200, 1262305200, 1262305200, 1262305200, 1262305200, 1262305200, 1262305200), class = c("POSIXct", "POSIXt"), tzone = ""), z = c(0, 0, 0, 100, 100, 100, 120, 120, 120, 140, 140, 140, 160, 160, 160, 180, 180, 180, 200, 200, 200, 40, 40, 40, 50, 50, 50, 60, 60, 60, 80, 80, 80, 0, 0, 0, 100, 100, 100, 120), param = c("temperature", "humidity", "barometric pressure", "wind direction", "turbulence", "wind speed", "wind direction", "turbulence", "wind speed", "wind direction", "turbulence", "wind speed", "wind direction", "turbulence", "wind speed", "wind direction", "turbulence", "wind speed", "wind direction", "turbulence", "wind speed", "wind direction", "turbulence", "wind speed", "wind direction", "turbulence", "wind speed", "wind direction", "turbulence", "wind speed", "wind direction", "turbulence", "wind speed", "temperature", "barometric pressure", "humidity", "wind direction", "wind speed", "turbulence", "wind direction"), value = c(-2.5, 41, 816.9, 248.4, 0.11, 4.63, 249.8, 0.28, 4.37, 255.5, 0.32, 4.35, 252.4, 0.77, 5.08, 248.4, 0.65, 3.88, 313, 0.94, 6.35, 250.9, 0.1, 4.75, 253.3, 0.11, 4.68, 255.8, 0.1, 4.78, 254.9, 0.11, 4.7, -3.3, 816.9, 42, 253.2, 2.18, 0.27, 229.5)), .Names = c("time", "z", "param", "value"), row.names = c(NA, 40L), class = "data.frame") 
+6
source share
2 answers

Use data.table :

 library(data.table) dt = data.table(data) setkey(dt, param) # sort by param to look it up fast dt[J('wind speed')][!is.na(value), list(sum(value) + sum(z), sum(value)/sum(z)), by = time] #         time    V1     V2 #1: 2009-12-31 18:10:00 1177.57 0.04209735 #2: 2009-12-31 18:20:00  102.18 0.02180000 

If you want to apply a different function for each parameter, there is a more unified approach for this.

 # make dt smaller because I'm lazy dt = dt[param %in% c('wind direction', 'wind speed')] # now let start - create another data.table # that will have param and corresponding function fns = data.table(p = c('wind direction', 'wind speed'), fn = c(quote(sum(value) + sum(z)), quote(sum(value) / sum(z))), key = 'p') fns p fn 1: wind direction <call> # the fn column contains functions 2: wind speed <call> # ie this is getting fancy! # now we can evaluate different functions for different params, # sliced by param and time dt[!is.na(value), {param; eval(fns[J(param)]$fn[[1]], .SD)}, by = list(param, time)] # param time V1 #1: wind direction 2009-12-31 18:10:00 3.712400e+03 #2: wind direction 2009-12-31 18:20:00 7.027000e+02 #3: wind speed 2009-12-31 18:10:00 4.209735e-02 #4: wind speed 2009-12-31 18:20:00 2.180000e-02 

PS I think that the fact that I have to use param some way before eval to eval works is a mistake.


UPDATE: Compared to version 1.8.11, this error is fixed and the following works:

 dt[!is.na(value), eval(fns[J(param)]$fn[[1]], .SD), by = list(param, time)] 
+14
source

Use dplyr. It is still under development, but much faster than plyr:

 # devtools::install_github(dplyr) library(dplyr) windspeed <- subset(data, param == "wind speed") daily <- group_by(windspeed, time) summarise(daily, V1 = sum(value) + sum(z), V2 = sum(value) / sum(z)) 

Another advantage of dplyr is that you can use the data table as a backend without knowing anything about the special data.table syntax:

 library(data.table) daily_dt <- group_by(data.table(windspeed), time) summarise(daily_dt, V1 = sum(value) + sum(z), V2 = sum(value) / sum(z)) 

(dplyr with a data frame is 20-100x faster than plyr, and dplyr with a data table. The table is about 10 times faster). dplyr is nowhere near as concise as data.table, but it has a function for every important data analysis task, which, it seems to me, makes the code more understandable - you can almost read the sequence of dplyr operations to someone else and ask them to understand what's happening.

If you want to make different summaries of the variable, I recommend changing the data structure as " tidy ":

 library(reshape2) data_tidy <- dcast(data, ... ~ param) daily_tidy <- group_by(data_tidy, time) summarise(daily_tidy, mean.pressure = mean(`barometric pressure`, na.rm = TRUE), sd.turbulence = sd(`barometric pressure`, na.rm = TRUE) ) 
+9
source

Source: https://habr.com/ru/post/954791/