Best approach to finding first and last names from blob text

Question

Best approach to finding first and last names from blob text

I am working on a program that does OCR on a US business card, and is trying to return information such as first name, last name, etc. The challenge is how to do this.

So far, I have created the following data files:

first_names.txt (Contains 23k+ first names) last_names.txt (Contains 86k+ last names) job_title.txt (Contains 500+ job titles) us_cities.txt (Contains 10k+ us cities) states_full.txt (Contains full names of all US states) states_abv.txt (Contains all US state abbreviations)

The goal was for me to label the OCR data with spaces and try to assign a “weight” to each row, based on the likelihood that it is a particular data type.

For example, an earlier line in a text blob will most likely be a name, company, or name. Similarly, if a string is found in first_names.txt or last_names.txt, then it will have more weight relative to the first / last name.

This approach sounds normal in theory, but I wonder how best to approach it in terms of programming. (PHP, not that language). The hard part is that some token weight depends on other tokens. For instance:

If the token appears to be the first name, it is likely that the next token is the last name.
Some tokens are connected to each other, but if things are torn in space, I'm not sure how to connect them. For example, Anne Marie, FL will be counted as three tokens - Anne, Marie, and FL. Even worse, Anne and Marie gained weight as a name. Now, if the weight is also awarded depending on the position, the previous line with the weight of the first name can win by freeing these lines so that they can be defined as a city.

I know a lot of smart people there, so maybe someone has an idea on this!

+4

algorithm php tokenize logic tagging

Anthony Nov 20 '11 at 3:09

source share

1 answer

Ethunk · Accepted Answer · 2011-12-20T06:46:48+0000

It is useful to know the exceptions (for example, a city called Mary Sue), but end users should be happy if your software can handle the most likely cases. Names can be sorted by the relative frequency of occurrence in each category: personal name, company name, city name. For companies, the number of employees can be used to calculate relative probability. For cities, population.

Do you already have rules for checking the relative position of the string containing each token?

Of course, there are quite a few business card formats, but if you have several hundred typical business cards, you should be able to identify some common formatting rules. Having just a few rules might help a lot. One rule may be "80% of all cards have an address under a personal name and company name." Although your sample of business cards cannot really be representative of all possible business cards, all languages, etc. Etc., This is the Beginning. Even a few 50% and 80% rules can simplify your task.

Perhaps you can come up with a few rules using an absurd example.

  John smith
 Chief Operating Officer
 Acme Inc.
 123 Main Street
 Somewhere, XZ 01010

more likely than

  Somewhere, XZ
 01010
 John smith
 Acme Inc.
 Chief Operating Officer
 123 Main Street

This suggests that we can consider the relative Y-position of personal and company names with respect to postal codes. Although a personal name, job title, and company name can follow in any of several orders, zip codes are likely to be located below company names. Zip codes will be closer to city names, etc.

Although a word like Samantha may be part of a personal name, street name, or company name, it is most likely the name of the person. You should be able to find databases that indicate the relative frequency of birth names, the population of cities with the name Samantha, and the number of registered corporations with the name Samantha. Even partial databases will be useful in establishing some reasonable probability factors.

Other possible rules:

A mixture of letters and numbers of 5-7 digits at the end of the line (for text from left to right) or in its own line is likely to be a postal code.
"Inc", "Ltd", "Corp" and other abbreviations should increase the likelihood that the string will be identified as the company name
A personal name is likely to be located above the title. (Maybe 85% - 95% of the time?)
Phone numbers follow a somewhat limited number of patterns and typically include characters not found in the postal codes: "(" ")" "."
Websites follow common patterns. Even if there is someone whose legal name is “CarolGreen.com”, she probably won’t be surprised if her name is recognized by the website.
The @ symbol is almost certainly part of the email address. The email address is probably located on some line under the person’s name, suggesting that the email address is generally displayed.
Some information may be missing. The website could not be indicated on the map. It may be a phone number, but not a street address. A person may not have a title. A personal business card may not have a company name. Most likely, at least one line will be a personal name.

Best approach to finding first and last names from blob text

More articles: