Convert row to column data by a specific row name in R

Hey, so I'm pretty new to R and only know some features. I have row data of about 2,000,000 rows.

Raw data is similar to this, the product has four types of tariffs (AHS, BND, MFN, PRF). Some data have PRF, and some do not. The goal is to convert the tariff of each item into a column by type of tariff.

AHS 3.00 BND 3.80 MFN 4.00 PRF 2.00 AHS 4.00 BND 3.80 MFN 4.00 

How to convert raw data as follows:

 AHS BND MFN PRF 3.00 3.80 4.00 2.00 4.00 3.80 4.00 NA 

I tried rbind, for those who do not have PRF, R will assign AHS PRF.

Can someone tell me how to do this conversion? Many thanks!

+5
source share
2 answers

Create a grp variable that is 1 for the first group, second for the second, etc. Then use tapply

 grp <- cumsum(DF$V1 == "AHS") tapply(DF$V2, list(grp, DF$V1), sum) 

giving:

  AHS BND MFN PRF 1 3 3.8 4 2 2 4 3.8 4 NA 

We used this as data:

 DF <- data.frame(V1 = c("AHS", "BND", "MFN", "PRF", "AHS", "BND", "MFN"), V2 = c(3, 3.8, 4, 2, 4, 3.8, 4), stringsAsFactors = FALSE) 
+2
source

You can use ave in the R base or a comparable approach in the package to create an id variable. Since some "PRF" values ​​are missing, you probably also need to use cummax during the id creation phase.

Here are a few alternatives, all of which use @ G.Grothendieck sample data. My vote will go for the "data.table" approach.

 DF <- data.frame( V1 = c("AHS", "BND", "MFN", "PRF", "AHS", "BND", "MFN"), V2 = c(3, 3.8, 4, 2, 4, 3.8, 4), stringsAsFactors = FALSE) 

Base R: reshape

Notorious for its syntax ... and probably not recommended for working with 2M strings ....

 reshape(within(DF, { id <- cummax(ave(V1, V1, FUN = seq_along)) }), direction = "wide", idvar = "id", timevar = "V1") 

Base R: xtabs

It’s easier to remember the syntax, but less flexible. Also returns matrix , so you will need to use as.data.frame.matrix if you want to get data.frame . Fills in missing values ​​with "0", which may be undesirable.

 xtabs(V2 ~ id + V1, within(DF, { id <- cummax(ave(V1, V1, FUN = seq_along)) })) 

"data.table"

Fast. Predictable behavior from dcast.data.table after behavior long established by dcast from "reshape2".

 library(data.table) dcast.data.table( as.data.table(DF)[, id := sequence(.N), by = V1][, id := cummax(id)], id ~ V1, value.var = "V2") # id AHS BND MFN PRF # 1: 1 3 3.8 4 2 # 2: 2 4 3.8 4 NA 
+3
source

Source: https://habr.com/ru/post/1203978/


All Articles