I have several CSV files with columns such as gender, age, diagnosis, etc.
They are currently encoded as such:
ID, gender, age, diagnosis 1, male, 42, asthma 1, male, 42, anxiety 2, male, 19, asthma 3, female, 23, diabetes 4, female, 61, diabetes 4, female, 61, copd
The goal is to convert this data to this target format :
Sidenote: if possible, it would be great to also add the original column names to the new column names, for example. 'age_42' or 'gender_female.'
ID, male, female, 42, 19, 23, 61, asthma, anxiety, diabetes, copd 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0 2, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0 3, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0 4, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1
I tried using the reshape2 dcast() function, but I get combinations that lead to extremely sparse matrices. Here's a simplified example with only age and gender:
data.train <- dcast(data.raw, formula = id ~ gender + age, fun.aggregate = length) ID, male19, male23, male42, male61, female19, female23, female42, female61 1, 0, 0, 1, 0, 0, 0, 0, 0 2, 1, 0, 0, 0, 0, 0, 0, 0 3, 0, 0, 0, 0, 0, 1, 0, 0 4, 0, 0, 0, 0, 0, 0, 0, 1
Having seen that this is a fairly common task in preparing machine learning data, I believe that there may be other libraries (which I do not know) that can perform this conversion.