It is useful to know the exceptions (for example, a city called Mary Sue), but end users should be happy if your software can handle the most likely cases. Names can be sorted by the relative frequency of occurrence in each category: personal name, company name, city name. For companies, the number of employees can be used to calculate relative probability. For cities, population.
Do you already have rules for checking the relative position of the string containing each token?
Of course, there are quite a few business card formats, but if you have several hundred typical business cards, you should be able to identify some common formatting rules. Having just a few rules might help a lot. One rule may be "80% of all cards have an address under a personal name and company name." Although your sample of business cards cannot really be representative of all possible business cards, all languages, etc. Etc., This is the Beginning. Even a few 50% and 80% rules can simplify your task.
Perhaps you can come up with a few rules using an absurd example.
John smith
Chief Operating Officer
Acme Inc.
123 Main Street
Somewhere, XZ 01010
more likely than
Somewhere, XZ
01010
John smith
Acme Inc.
Chief Operating Officer
123 Main Street
This suggests that we can consider the relative Y-position of personal and company names with respect to postal codes. Although a personal name, job title, and company name can follow in any of several orders, zip codes are likely to be located below company names. Zip codes will be closer to city names, etc.
Although a word like Samantha may be part of a personal name, street name, or company name, it is most likely the name of the person. You should be able to find databases that indicate the relative frequency of birth names, the population of cities with the name Samantha, and the number of registered corporations with the name Samantha. Even partial databases will be useful in establishing some reasonable probability factors.
Other possible rules:
- A mixture of letters and numbers of 5-7 digits at the end of the line (for text from left to right) or in its own line is likely to be a postal code.
- "Inc", "Ltd", "Corp" and other abbreviations should increase the likelihood that the string will be identified as the company name
- A personal name is likely to be located above the title. (Maybe 85% - 95% of the time?)
- Phone numbers follow a somewhat limited number of patterns and typically include characters not found in the postal codes: "(" ")" "."
- Websites follow common patterns. Even if there is someone whose legal name is “CarolGreen.com”, she probably won’t be surprised if her name is recognized by the website.
- The @ symbol is almost certainly part of the email address. The email address is probably located on some line under the person’s name, suggesting that the email address is generally displayed.
- Some information may be missing. The website could not be indicated on the map. It may be a phone number, but not a street address. A person may not have a title. A personal business card may not have a company name. Most likely, at least one line will be a personal name.