Best way to copy a set of mixed content pages

Question

Best way to copy a set of mixed content pages

I'm trying to show a list of dining places around the office with their menus today. But the problem is that websites offering lunch menus do not always offer the same content.

For example, some websites offer nice JSON output. Look at this , it offers English / Finnish names separately, and everything I need is available. There are a couple like this.

But others do not have a good conclusion. Like this one . The content is presented in plain HTML, and the English and Finnish food names are not exactly ordered. Also, nutritional properties such as (L, VL, VS, G, etc.) are plain text, such as the name of the product.

What, in your opinion, is the best way to clear all this available data in different formats and turn them into usable data? I tried to make a scraper with Node.js (& phantomjs, etc.), but it only works with one website, and its not so accurate in case of product names.

Thanks in advance.

+4

json javascript node.js web-crawler web scraping

Sallar Jul 21 '15 at 10:37

source share

3 answers

Babak · Answer 1 · 2015-07-21T11:11:49+0000

You can use something like kimonolabs.com , they are much easier to use, and they give you an API to update your side. Remember that they are best suited for the contents of tabular data.

Reza Shadman · Answer 2 · 2015-07-21T12:05:47+0000

. , ( ).

, TF/IDF. TF/IDF , . .

, :

-, .
-.
.

PHP, Simple HTML Dom Parser Guzzle - . jQuery, , arround HTTP.

meaclum · Answer 3 · 2015-07-22T05:57:48+0000

. , .

:

- . . , , :

( )
( )
, ( ).
/ .

1-2 , 3-4 / .

1-2 , - - - - . , (, ) .

, , , / . 100% - .

, . Java/ Groovy (Mule ESB/Spring Integration) .

: . , 100% ( ).

Best way to copy a set of mixed content pages

More articles: