I am working on a dataset where the source name is indicated by the abbreviation of the two letters before the variable. Thus, all variables from source AA begin with AA_var1, and source bb has bb_variable_name_2. There are actually many sources and many variable names, but I leave only 2 as a minimal example.
I want to create an average variable for any row where the number of sources, that is, the number of unique prefixes for which the data in this row is not NA, is greater than 1. If there is only one source, I want the general variable to be NA.
So, for example, my data looks like this:
> head(df)
AA_var1 AA_var2 myid bb_meow bb_A_v1
1 NA NA 123456 10 12
2 NA 10 194200 12 NA
3 12 10 132200 NA NA
4 12 NA 132201 NA 12
5 NA NA 132202 NA NA
6 12 13 132203 14 NA
And I want the following:
> head(df)
AA_var1 AA_var2 myid bb_meow bb_A_v1 rowMeanIfDiverseData
1 NA NA 123456 10 12 NA #has only bb
2 NA 10 194200 12 NA 11 #has AA and bb
3 12 10 132200 NA NA NA #has only AA
4 12 NA 132201 NA 12 12 #has AA and bb
5 NA NA 132202 NA NA NA #has neither
6 12 13 132203 14 NA 13 #has AA and bb
rowMeans() . , / /, , .
:
mynames <- names(df[!names(df) %in% c("myid")])
tmp <- str_extract(mynames, perl("[A-Za-z]{2}(?=_)"))
uniq <- unique(tmp[!is.na(tmp)])
,
> uniq
[1] "AA" "bb"
, , df :
multiSource <- function(x){
nm = names(x[!names(x) %in% badnames])
tmp <- str_extract(nm, perl("[A-Za-z]{2}(?=_)"))
uniq <- unique(tmp[!is.na(tmp)]) # ensure unique and not NA
if (length(uniq) > 1){
return(T)
} else {
return(F)
}
}
, ..
> lapply(df,multiSource)
$AA_var1
[1] FALSE
$AA_var2
[1] FALSE
$bb_meow
[1] FALSE
$bb_A_v1
[1] FALSE
...
> apply(df,MARGIN=1,FUN=multiSource)
TRUE .
...
df$rowMean <- rowMeans(df, na.rm=T)
# so, in this case
rowMeansIfTest <- function(X,test) {
# is this row muliSource True?
# if yes, return(rowMeans(X))
# else return(NA)
}
df$rowMeanIfDiverseData <- rowMeansIfTest(df, test=multiSource)
, - .