How do I know if a variable (for example, an abbreviation) matches a string in a list with another list if the original does not match?

I'm currently looking for a method in R that allows me to combine / merge two data frames. Helas both of these data frames contain not optimal data. They may have certain abbreviations even typos in them. Therefore, I would like to define a list for each abbreviation, and if the line contains one of these elements. If the source records do not match, R must check if any of the other abbreviations match. To illustrate: a company name may end with “Limited,” but also with “Ltd.,” Ltd., etc.

Example


Data

The original "Address" file contains:

Company name         Address 
Deloitte Ltd.        New York
Coca-Cola            New York
Tesla ltd            California
Microsoft Limited    Washington

Must be combined with "EnterpriseNrList"

Company name         EnterpriseNumber
Deloitte Ltd.        221
Coca-Cola            334
Tesla ltd            725
Microsoft Limited    127

, " ". , R , R . .

""

Limited.
limited 
Ltd.
ltd. 
Ltd
ltd


1) ?

2) ( 1, . ), containsx excel?

3) , , ( 2, . )?


1

, : , , , , -1, > 0, . "". 2.

, ( "" ).

2

1 . , , f.e. Coca-Cola .

Coca-Cola Limited
Coca-Cola Ltd. 
Coca-Cola Ltd
etc.

3

, / "". 2 , , 8000 .

+4
2

, .

, , , , " " = .

abbrevs <- list('Limited'=c('Limited','Ltd'),'Incorporated'=c('Incorporated','Inc'))

( ​​, gsub agrep ):

regexes <- lapply(abbrevs,function(x) { paste0("(",paste0(x,collapse='|'),")[.]?") })

:

$Limited
[1] "(Limited|Ltd)[.]?"

$Incorporated
[1] "(Incorporated|Inc)[.]?"

company.name df:

for (i in seq_along(regexes)) { 
  Address$Company.name <- gsub(regexes[[i]], names(regexes[i]), Address$Company.name, ignore.case=TRUE)
  Enterprise$Company.name <- gsub(regexes[[i]], names(regexes[i]), Enterprise$Company.name, ignore.case=TRUE)
} 

. agrep adist, .

:

> Address
       Company.name    Address
1  Deloitte Limited   New York
2         Coca-Cola   New York
3     Tesla Limited California
4 Microsoft Limited Washington

:

Address <- structure(list(Company.name = c("Deloitte Ltd.", "Coca-Cola", 
"Tesla ltd", "Microsoft Limited"), Address = c("New York", "New York", 
"California", "Washington")), .Names = c("Company.name", "Address"
), class = "data.frame", row.names = c(NA, -4L))

Enterprise <- structure(list(Company.name = c("Deloitte Ltd.", "Coca-Cola", 
"Tesla ltd", "Microsoft Limited"), EnterpriseNumber = c(221L, 
334L, 725L, 127L)), .Names = c("Company.name", "EnterpriseNumber"
), class = "data.frame", row.names = c(NA, -4L))
+1

, , .

, , grep grepl. (grep , , grepl ). , ignore.case= TRUE , / .

, "" ( , "", "",). : unlist(strsplit(CompanyNames,split = " "))

, .

, !

0

Source: https://habr.com/ru/post/1623947/


All Articles