How do you anonize a vector in a way that generates readable output in R?

Question

How do you anonize a vector in a way that generates readable output in R?

In order to protect research objects from identification in data sets, I am interested in anonymous vectors in R. However, I also want to be able to refer to the result when writing a study (for example, "subject [random id] showed ..."). I found that I can use the anonymizer package to easily generate short hash codes, but, referring to the short hashes in writing, it’s not perfect (for example, “subject f4d35fab showed ...” it’s difficult to remember, a little from the cavity mouth and makes it difficult to distinguish other hashed data, for example, "the subject f4d35fab from 8b3bd334 showed ...").

Is there a way to convert hashes into random strings readable by humans, or anonymize data in a non-cryptocentric way?

+4

r cryptography hash

joshisanonymous Mar 15 '18 at 19:12

source share

3 answers

C. Braun · Answer 1 · 2018-03-15T19:29:28+0000

How to simply assign a random number to each subject:

> subjects <- c("Matthew", "Mark", "Luke", "John")
> subjects.anon <- sample(length(subjects))
> subjects.anon
[1] 1 4 2 3

Then you can talk about subject 4 with data that relate to the sign.

And if you need numbers that are not related to the number of items:

sample(1000, length(subjects)) # [1] 789 103 435 983

MrFlick · Answer 2 · 2018-03-15T19:43:15+0000

Just use the list of links for human-readable names and map them to each unique value of a true identifier. It depends on how many values you need to create aliases.

One such source is a list of children's names (here, the 1,000 most common names since 2010). for instance

library(babynames)
library(dplyr)

samples <- data.frame(id=1:50, age=rnorm(50, 30, 5))    

translate <- babynames %>% filter(year==2010) %>% 
  top_n(1000, n) %>% 
  sample_n(length(unique(samples$id))) %>% 
  select(alias_id=name) %>%
  bind_cols(id=unique(samples$id))

translate
#     alias_id    id
#        <chr> <int>
#  1   Savanna     1
#  2    Jasmin     2
#  3   Natalie     3
#  4      Omar     4
#  5   Tristan     5
#  6  Jeremiah     6
#  7   Arielle     7
#  8    Tanner     8
#  9 Francesca     9
# 10     Devin    10
# # ... with 40 more rows

, .

smci · Answer 3 · 2018-03-15T19:32:23+0000

Take the first m characters of the hash if it is unique in the first m. (This value of m will tend to be O (log (N)), where N is the number of objects.) Here is an example code:

set.seed(1)
v <- do.call(paste0, replicate(n=8, sample(LETTERS, size=100, replace=T), simplify=F))

unique_in_first_m_chars <- function(v, m) {
  length(unique(substring(v, 1, m))) == length(v)
}

unique_in_first_m_chars(v, 4)
[1] TRUE
unique_in_first_m_chars(v, 3)
[1] FALSE
unique_in_first_m_chars(v, 2)
[1] FALSE

How do you anonize a vector in a way that generates readable output in R?

More articles: