Smart single-entry search

Question

Smart single-entry search

I looked through some kind of social network and found there an opportunity to search for a person by: name, age, city, country and gender. Interestingly, all this information can be inserted into a single text field, separated by a space. Then the search engine somehow analyzes it very accurately and returns a list of results.

On the one hand, it looks pretty simple: separate the query by space and find all the relevant tables to occur. So far, so good. but

There are cities with names of more than 2 words, and the user can enter them differently , since this is free text.
There are more than two words

Question:

How can we break up a query in such a way that we, of course, know what part of it should be sought where? that is, the name in the user table, the city in the city table, the country in the countries, etc.?

What i have done so far:

populate the data source with all users
Check if a country from tableexist countries is indicated in the request
if exists, then filter the data source to have users only from this country
Check if the Cities table in the query exists in the query
if exists, then filter the data source to have users only from this city

etc. for each table, while each time we find a match in the table, we remove the part found from the query, leaving us with the freest parameter: name .

This seems to work if the user knew exactly how cities / countries, etc. are written in my db, but the reality is that the user can enter part of the city or mislead the city.

I do not know if I am really in the right direction with what I have done. This is just the starting point ...

PS: I just need a flow of algorithms, so the programming language is not really measured. Any idea or guide is more than welcome.

thanks

+4

c # sql php

jekcom Dec 24 '11 at 20:57

source share

2 answers

I have no experience here, but I think this is natural language processing

I think part of this kind of processing is the recognition that you will not always be right. It follows that your goal is to try to identify cases when you are confident that you are making certain assumptions.

For instance,

If a user searched for jane doe in New York City, they would not spell it as jane new york city doe , the name and city would always be adjacent groups. You do not know the length of each group, but you only have a finite number of combinations to try. Given jane doe new york city , you can iterate over combinations of adjacent groups.

 scoreAsName('jane') scoreAsName('jane doe') scoreAsName('jane doe new')

... and so on ... and do the same for scoreAsCity.

There should be some clear combinations with a win for both. Maybe the best choice would be a combo by name and city, which will give the highest total amount. You will need to make a scoring algorithm, probably largely based on matches with the databases, but it can also use auxiliary input, for example, increase the local name score.

Very interesting topic.

0

goat Dec 24 '11 at 10:55

source share

Lb · Accepted Answer · 2011-12-24T21:44:49+0000

These queries are not suitable for relational databases . If this is optional, you might consider using Lucene.Net (C #) or Lucene (Java)

Smart single-entry search

More articles: