How to effectively remove (or add) leading zeros to IP addresses in R?

Two data frames in R contain fields for IP addresses. In each data frame, these fields are β€œfactors”. The user intends to combine the two data frames based on these IP addresses, as well as several other fields. The problem is that each data file has different formats for IP addresses:

Dataframe A examples: 123.456.789.123, 123.012.001.123, 987.001.010.100 

The same IP addresses in Dataframe B will be formatted as:

 Dataframe B examples: 123.456.789.123, 123.12.1.123, 987.1.10.100 

What is the best (most efficient) way to either remove leading zeros from A, or add them to B so that they can be used in a merge? The operation will be performed on millions of records, so the "most efficient" takes into account the calculation time (should be relatively fast).

+4
source share
2 answers

You can use sprintf to format partitions. For example, for a given numeric value of a you can do the following:

 b <- sprintf("%.3d", a) 

So, for the IP address, try this function:

 printPadded <- function(x){ retStr = paste(sprintf("%.3d",unlist(lapply(strsplit(x,"\\.", perl = TRUE), as.numeric))), collapse = ".") return(retStr) } 

Here are two examples:

 > printPadded("1.2.3.4") [1] "001.002.003.004" > lapply(c("1.2.3.4","5.67.100.9"), printPadded) [[1]] [1] "001.002.003.004" [[2]] [1] "005.067.100.009" 

To move in the other direction, we can remove the leading zeros using gsub for shared values ​​in the printPadded function. For my money, I would recommend not removing leading zeros. It is not necessary to remove zeros (or fill them), but fixed-width formats are easier to read and sort (i.e., for those sorting functions that are lexicographic). A.


Update 1: just a hint about speed: if you are dealing with a large number of IP addresses and really want to speed it up, you can look at multi-core methods like mclapply . The plyr package plyr also useful, and ddply() as one option. They also support parallel servers via .parallel = TRUE . However, several million IP addresses should not be very time consuming even on a single core.

+6
source

Another way:

 my @ipparts = split(/\./, $ip); for my $ii (0..$#ipparts) { $ipparts[$ii] = $ipparts[$ii]+0; } $ip = join(".", @ipparts); 

The more whole sections required by sprintf.

+2
source

Source: https://habr.com/ru/post/1381807/


All Articles