Search String Algorithms


I am trying to get contact information on content pages from a set of websites (thousands of them). I wanted to ask experts like you before scratching my head. All I need is an address, email identifiers, phone numbers and contact information, if available.

I think you already understood the problem. Yes, this is formatting ... since there is no standard format that follows sites, it is really difficult to specify the exact information I need. Some websites are designed with flash contact pages, and some other websites specify contact information as types of images with custom fonts.

And tips / ideas / suggestions are mostly welcome ...

Thanks....

+6
source share
3 answers

This, as one would expect, is by no means a trivial task. Here is one way to get closer to this:

  • Use an inverted indexing system such as Lucene / Solr or Sphinx to index pages. You may need to write your own tracked / spider. Apache Nutch and other caterpillars suggest swinging out of the box. If the content is pretty static, download them locally.

  • After indexing the content, you can request it for email addresses, phone numbers, etc. by creating a logical query, for example: // for email // for parentheses # in the phone. Content: @AND (Content: .COM OR Content: .NET) OR Content: "(" OR Content: ") Important: the previous code should not be taken literally. You could become even more attractive using Lucene Regex Query and Span Query that will allow you to create fairly complex queries.

  • Finally, on the results pages (a) run the result marker to get the fragment around the query and (b) the fragments, run regex to extract the fields of interest.

  • If you have a dataset from North America, you can run multiple passes to verify addresses using i) a mapping provider such as Bing Maps or a Google map to verify addresses. As far as I know, USPS and others offer valid address searches for a fee, for checking US zip codes and Canadian postal codes. or, ii) reverse DNS lookup of email addresses, etc ...

This should get you started ... as I said, there is no single best solution, you will need to try several approaches to iteration and achieve the desired level of accuracy.

Hope this helps.

+10
source

Conditional random fields were used precisely for such tasks, and were quite successful. You can use CRF ++ or Stanford Named Entity Recognizer . Both can be invoked from the command line without the need to write any explicit code.

In short, you should be able to train these algorithms first by providing them with a few examples of names, email identifiers, etc. from web pages so that they learn to recognize these things. Once these algorithms are smart (due to the examples you gave them), you can run them on your data and see what you get.

Don't be afraid to look at the wikipedia page. Packages come with many examples, and you should be up and running within hours.

+3
source

@Mikos is right, you will definitely need several approaches. Another possible tool to consider is Web-Harvest . This is a web data collection tool that allows you to collect websites and retrieve data that interests you. All this is done through XML configuration files. The software has a graphical interface and a command line interface.

It allows you to use text manipulation methods / xml, such as XSLT, XQuery and regular expressions, you can also create your own plugins. However, it mainly focuses on HTML / XML-based websites.

+1
source

Source: https://habr.com/ru/post/904200/


All Articles