Removing only contiguous duplicates in a data frame in R

I have a data frame in R which must have duplicates. However, there are some duplicates that I will need to remove. In particular, I only want to remove adjacent adjacent duplicates, but the rest. For example, suppose I had a data frame:

df = data.frame(x = c("A", "B", "C", "A", "B", "C", "A", "B", "B", "C"), y = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)) 

The result is the following data frame

 xy A 1 B 2 C 3 A 4 B 5 C 6 A 7 B 8 B 9 C 10 

In this case, I expect "A, B, C, A, B, C, etc." to be repeated there. However, this is only a problem if I see duplicates next to the lines . In my example above, these will be lines 8 and 9, where the duplicate "B" is next to each other.

In my dataset, every time this happens, the first instance is always a user error, and the second is always the correct version. In very rare cases, there may be an instance where duplicates occur 3 (or more) times. However, in each case, I would always like to keep the last event. So, following the example above, I would like the final dataset to look like

 A 1 B 2 C 3 A 4 B 5 C 6 A 7 B 9 C 10 

Is there an easy way to do this in R? Thank you in advance for your help!


Edit: 11/19/2014 12:14 PM EST There was a solution submitted by Akron (spelling?), Which has since been deleted. Now I'm sure why, because it worked for me?

The decision was

 df = df[with(df, c(x[-1]!= x[-nrow(df)], TRUE)),] 

It seems to work for me, why was it deleted? For example, in cases with more than two consecutive duplicates:

 df = data.frame(x = c("A", "B", "B", "B", "C", "C", "C", "A", "B", "C", "A", "B", "B", "C"), y = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14)) xy 1 A 1 2 B 2 3 B 3 4 B 4 5 C 5 6 C 6 7 C 7 8 A 8 9 B 9 10 C 10 11 A 11 12 B 12 13 B 13 14 C 14 > df = df[with(df, c(x[-1]!= x[-nrow(df)], TRUE)),] > df xy 1 A 1 4 B 4 7 C 7 8 A 8 9 B 9 10 C 10 11 A 11 13 B 13 14 C 14 

Does this seem to work?

+5
source share
3 answers

Try

  df[with(df, c(x[-1]!= x[-nrow(df)], TRUE)),] # xy #1 A 1 #2 B 2 #3 C 3 #4 A 4 #5 B 5 #6 C 6 #7 A 7 #9 B 9 #10 C 10 

Explanation

Here we compare the element with the element preceding it. This can be done by removing the first element from the column and this column compared to the column from which the last element is removed (so that the lengths become equal)

  df$x[-1] #first element removed #[1] BCABCABBC df$x[-nrow(df)] #[1] ABCABCABB #last element `C` removed df$x[-1]!=df$x[-nrow(df)] #[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE 

In the above example, length 1 less than nrow of df as one item is deleted. To compensate for this, we can concatenate TRUE and then use this index to subset the dataset.

+4
source

Here is the rle solution:

 df[cumsum(rle(as.character(df$x))$lengths), ] # xy # 1 A 1 # 2 B 2 # 3 C 3 # 4 A 4 # 5 B 5 # 6 C 6 # 7 A 7 # 9 B 9 # 10 C 10 
+3
source

You can also try

 df[c(diff(as.numeric(df$x)), 1) != 0, ] 

In case x has a character class (rather than factor ), try

 df[c(diff(as.numeric(factor(df$x))), 1) != 0, ] # xy # 1 A 1 # 2 B 2 # 3 C 3 # 4 A 4 # 5 B 5 # 6 C 6 # 7 A 7 # 9 B 9 # 10 C 10 
+2
source

Source: https://habr.com/ru/post/1207237/


All Articles