I have two data.frames - one lookup table that tells me the set of products in the group. Each group has at least one product of type 1 and type 2.
The second data.frame tells me the details of the transaction. Each transaction can have one of the following products:
a) Only type s product from one of the groups
b) Only product type s 2 from one of the groups
c) Type 1 and type 2 product from the same group
For my analysis, I am interested to know c) above, that is, how many transactions have products of type 1 and 2 (from the same group). We will completely ignore the transaction if Product Type 1 and Type 2 are from different groups that are sold in the same transaction.
Thus, each product of type 1 or type 2 MUST belong to the same group.
Here is my search table:
> P_Lookup
Group ProductID1 ProductID2
Group1 A 1
Group1 B 2
Group1 B 3
Group2 C 4
Group2 C 5
Group2 C 6
Group3 D 7
Group3 C 8
Group3 C 9
Group4 E 10
Group4 F 11
Group4 G 12
Group5 H 13
Group5 H 14
Group5 H 15
For example, I will not have product G and product 15 in one transaction, because they belong to another group.
Here are the transactions:
TransactionID ProductID ProductType
a1 A 1
a1 B 1
a1 1 2
a2 C 1
a2 4 2
a2 5 2
a3 D 1
a3 C 1
a3 7 2
a3 8 2
a4 H 1
a5 1 2
a5 2 2
a5 3 2
a5 3 2
a5 1 2
a6 H 1
a6 15 2
My code is:
Now I was able to write code, using dplyrfor a short-term exchange of transactions from one group. However, I'm not sure how I can vectorize my code for all groups.
Here is my code:
P_Groups<-unique(P_Lookup$Group)
Chosen_Group<-P_Groups[5]
P_Group_Ind <- P_Trans %>%
group_by(TransactionID)%>%
dplyr::filter((ProductID %in% unique(P_Lookup[P_Lookup$Group==Chosen_Group,]$ProductID1)) |
(ProductID %in% unique(P_Lookup[P_Lookup$Group==Chosen_Group,]$ProductID2)) ) %>%
mutate(No_of_PIDs = n_distinct(ProductType)) %>%
mutate(Group_Name = Chosen_Group)
P_Group_Ind<-P_Group_Ind[P_Group_Ind$No_of_PIDs>1,]
This works well, as long as I manually select each group, i.e. installing Chosen_Group. However, I am not sure how I can automate this. One way, I think, is to use for a loop, but I know that the beauty of R is vectorization, so I want to stay away from using for a loop.
. . dplyr in for loop r, , .
DATA:
dput P_Trans:
structure(list(TransactionID = c("a1", "a1", "a1", "a2", "a2",
"a2", "a3", "a3", "a3", "a3", "a4", "a5", "a5", "a5", "a5", "a5",
"a6", "a6"), ProductID = c("A", "B", "1", "C", "4", "5", "D",
"C", "7", "8", "H", "1", "2", "3", "3", "1", "H", "15"), ProductType = c(1,
1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2)), .Names = c("TransactionID",
"ProductID", "ProductType"), row.names = c(NA, 18L), class = "data.frame")
dput P_Lookup:
structure(list(Group = c("Group1", "Group1", "Group1", "Group2",
"Group2", "Group2", "Group3", "Group3", "Group3", "Group4", "Group4",
"Group4", "Group5", "Group5", "Group5"), ProductID1 = c("A",
"B", "B", "C", "C", "C", "D", "C", "C", "E", "F", "G", "H", "H",
"H"), ProductID2 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15)), .Names = c("Group", "ProductID1", "ProductID2"), row.names = c(NA,
15L), class = "data.frame")
dput() P_Trans, :
structure(list(TransactionID = c("a1", "a1", "a1", "a2", "a2",
"a2", "a3", "a3", "a3", "a3", "a4", "a5", "a5", "a5", "a5", "a5",
"a6", "a6", "a7"), ProductID = c("A", "B", "1", "C", "4", "5",
"D", "C", "7", "8", "H", "1", "2", "3", "3", "1", "H", "15",
"22"), ProductType = c(1, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2,
2, 2, 2, 1, 2, 3)), .Names = c("TransactionID", "ProductID",
"ProductType"), row.names = c(NA, 19L), class = "data.frame")