Count line break characters

I have a df:

dput(df) structure(list(URLs = c("http://bursesvp.ro//portal/user/_/Banco_Votorantim_Cartoes/0-7f2f5cb67f1-22918b.html", "http://46.165.216.78/.CartoesVotorantim/Usuarios/Cadastro/BV6102891782/", "http://www.chalcedonyhotel.com/images/promoc/premiado.tam.fidelidade/", "http://bmbt.ro/portal/a3/_Votorantim_/VotorantimCartoes2016/0-7f2f5cb67f1-22928b.html", "http://voeazul.nl/azul/")), .Names = "URLs", row.names = c(NA, -5L), class = "data.frame") 

It describes different URLs and I am trying to count the number of characters of the host name , regardless of whether it is the actual name ( http: //hostname.com / .... ) Or IP ( http://000.000.000.000/ ... ). However, if this is the actual name, then I only want nchar between www. and .com. If it is IP, then all its numbers and "between them."

Expected result for the above sample data:

 exp_outcome 1 8 2 13 3 15 4 4 5 7 

I tried to do something with strsplit but could not get there.

+5
source share
3 answers

Another, possibly more direct way with another regex:

 nchar(sub("^http://(www\\.)?(([az]+)|([0-9.]+))(\\.[az]+)?/+.+$", "\\2", x$df)) #[1] 8 13 15 4 7 

Explanation:

  • ^http:// : searches for "http: //" after the start of a line
  • (www\\.)? : searches for "www.", zero or one time (so this is optional)
  • (([az]+)|([0-9.]+)) : the pattern to be captured: either lowercase letters, or one or more times, or numbers and dots.
  • (\\.[az]+)? : looking for "." followed by one or more lowercase letters, zero or one time (so optional)
  • /+.+$ : Looks for "/" followed by anything, one or more times to the end of the line

Note:

 sub("^http://(www\\.)?(([az]+)|([0-9.]+))(\\.[az]+)?/+.+$", "\\2", x$df) # [1] "bursesvp" "46.165.216.78" "chalcedonyhotel" "bmbt" "voeazul" 
+7
source

Here's how to do it (assuming your data.frame is called x ):

 domains = sub('^(http://)([^/]+)(.*)$', '\\2', x$df) # This will fail for IP addresses … hostname = sub('^(www\\.)?([^.]+)(\\..+)?$', '\\2', domains) # … which we treat separately here: is_ip = grepl('^(\\d{1,3}\\.){3}\\d{1,3}$', domains) hostname[is_ip] = domains[is_ip] exp_outcome$domain_length = nchar(hostname) 

On the side of the note, I converted the original data.frame to character strings - it just doesn't make sense to use factor for URLs.

+5
source

After 5 months of working with URLs in general, I found the following packages that make life easier (the regex provided by the other answers works fine, by the way),

 library(urltools) library(iptools) df$Hostname <- domain(df$URLs) #However, TLDs and 'www' need to go so I used suffix_extract()$domain from `iptools` df$Hostname <- ifelse(is.na(suffix_extract(df$Hostname)$domain), df$Hostname, suffix_extract(df$Hostname)$domain) #which gives: # URLs Hostname #1 http://bursesvp.ro//portal/user/_/... bursesvp #2 http://46.165.216.78/.CartoesVotorantim/Usuarios/... 46.165.216.78 #3 http://www.chalcedonyhotel.com/images/promoc/ chalcedonyhotel #4 http://bmbt.ro/portal/a3/_Votorantim_/... bmbt #5 http://voeazul.nl/azul/ voeazul #then simply, nchar(df$Hostname) #[1] 8 13 15 4 7 
0
source

Source: https://habr.com/ru/post/1240548/


All Articles