R - filter data from the data frame

I am a new guy in R and really don't know how to filter data in a date frame.

I created a data frame with two columns, including the monthly date and the corresponding temperature. It has a length of 324.

> head(Nino3.4_1974_2000) Month_common Nino3.4_degree_1974_2000_plain 1 1974-01-15 -1.93025 2 1974-02-15 -1.73535 3 1974-03-15 -1.20040 4 1974-04-15 -1.00390 5 1974-05-15 -0.62550 6 1974-06-15 -0.36915 

The rule of the filter is to choose a temperature that is greater than or equal to 0.5 degrees. In addition, it must be at least 5 months.

I delete data with a temperature of less than 0.5 degrees (see below).

 for (i in 1) { el_nino=Nino3.4_1974_2000[which(Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain >= 0.5),] } > head(el_nino) Month_common Nino3.4_degree_1974_2000_plain 32 1976-08-15 0.5192000 33 1976-09-15 0.8740000 34 1976-10-15 0.8864501 35 1976-11-15 0.8229501 36 1976-12-15 0.7336500 37 1977-01-15 0.9276500 

However, I still need to constantly extract 5 months. I want someone to help me.

+4
source share
2 answers

If you can always rely on a period of one month, then temporarily cancel the time information:

 temps <- Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain 

So, since each temperature in this vector is always divided by one month, we just need to look for runs where temps[i]>=0.5 , and the run should be at least 5.

If we do the following:

 ofinterest <- temps >= 0.5 

we will have a vector ofinterest with values TRUE FALSE FALSE TRUE TRUE .... etc., where it is TRUE when temps[i] was> = 0.5 and FALSE otherwise.

To rephrase your problem, we just need to look for occurrences of at least five TRUE per line .

For this we can use the rle function. ?rle gives:

 > ?rle Description Compute the lengths and values of runs of equal values in a vector - or the reverse operation. Value: 'rle()' returns an object of class '"rle"' which is a list with components: lengths: an integer vector containing the length of each run. values: a vector of the same length as 'lengths' with the corresponding values. 

So, we use rle , which counts all the rows of a sequential TRUE in a row and sequential FALSE in a row and looks for at least 5 TRUE in a row.

I will just do some data to demonstrate:

 # for you, temps <- Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain temps <- runif(1000) # make a vector that is TRUE when temperature is >= 0.5 and FALSE otherwise ofinterest <- temps >= 0.5 # count up the runs of TRUEs and FALSEs using rle: runs <- rle(ofinterest) # we need to find points where runs$lengths >= 5 (ie more than 5 in a row), # AND runs$values is TRUE (so more than 5 'TRUE in a row). streakIs <- which(runs$lengths>=5 & runs$values) # these are all the el_nino occurences. # We need to convert `streakIs` into indices into our original `temps` vector. # To do this we add up all the `runs$lengths` up to `streakIs[i]` and that gives # the index into `temps`. # that is: # startMonths <- c() # for ( n in streakIs ) { # startMonths <- c(startMonths, sum(runs$lengths[1:(n-1)]) + 1 # } # # However, since this is R we can vectorise with sapply: startMonths <- sapply(streakIs, function(n) sum(runs$lengths[1:(n-1)])+1) 

Now, if you make Nino3.4_1974_2000$Month_common[startMonths] , you will get all the months that El Nino started.

It comes down to a few lines:

 runs <- rle(Nino3.4_1974_2000$Nino3.4_degree_1974_2000_plain>=0.5) streakIs <- which(runs$lengths>=5 & runs$values) startMonths <- sapply(streakIs, function(n) sum(runs$lengths[1:(n-1)])+1) Nino3.4_1974_2000$Month_common[startMonths] 
+4
source

Here is one way to take advantage of the fact that months are always regular for one month. Then the problem comes down to finding 5 consecutive lines with temps> = 0.5 degrees:

 # Some sample data d <- data.frame(Month=1:20, Temp=c(rep(1,6),0,rep(1,4),0,rep(1,5),0, rep(1,2))) d # Use rle to find runs of temps >= 0.5 degrees x <- rle(d$Temp >= 0.5) # The find the last row in each run of 5 or more y <- x$lengths>=5 # BUG HERE: See update below! lastRow <- cumsum(x$lengths)[y] # Finally, deduce the first row and make a result matrix firstRow <- lastRow - x$lengths[y] + 1L res <- cbind(firstRow, lastRow) res # firstRow lastRow #[1,] 1 6 #[2,] 13 17 

UPDATE I had an error that detected runs with 5 values โ€‹โ€‹less than 0.5. Here's the updated code (and test data):

 d <- data.frame(Month=1:20, Temp=c(rep(0,6),1,0,rep(1,4),0,rep(1,5),0, 1)) x <- rle(d$Temp >= 0.5) y <- x$lengths>=5 & x$values lastRow <- cumsum(x$lengths)[y] firstRow <- lastRow - x$lengths[y] + 1L res <- cbind(firstRow, lastRow) res # firstRow lastRow #[2,] 14 18 
+1
source

Source: https://habr.com/ru/post/1391582/


All Articles