Get information from a web page (title, images, chapters, etc.)

On Facebook, when you add a link to your wall, it gets the name, images and part of the text. I saw this behavior on other sites where you can add links, how does it work? Does he have a name? Is there a javascript / jQuery extension that implements it?

And how is it possible that facebook goes to another site and receives html when it is supposedly forbidden to make a cross-site ajax call?

Thanks.

+4
source share
5 answers

You can use the server side of the PHP script to retrieve the contents of any web page (search for web clips). What facebook does is call the server side php script via ajax, which has a php function called

file_get_contents('http://somesite.com.au'); 

now, as soon as the file or web page is pulled onto the server side of the script, you can then filter the content for something in particular. eg. Facebooks get link will search for the name, img and meta property = "of the description of the parts of the file or web page using regular expression

eg. Php

 preg_match(); Function. 

This can be collected and then returned to your web page.

You may also consider adding additional functions to return the data you want, as clearing some pages may take longer than expected to return the desired information. eg. filter out unnecessary things like javascript, css, non-relational tags, huge images, etc. to make it work faster.

If you get this patch, you could potentially be on the road to building a web search engine, or better yet, collecting data from sites like yellowpages, for example. phone numbers, mailing addresses, etc.

You can also look further:

 get_meta_tags('http://somesite.com.au'); 

:-)

+4
source

Basic methodology

When triggering a sampling event (for example, pasting it into Facebook on Facebook), you can use AJAX to query url *, and then analyze the returned data as you wish.

Data analysis is a difficult bit because many websites have different standards. Taking the text between the title tags is a good start, as well as possibly searching for META descriptions (but they are being used less and less as search engines develop into more complex content-based search queries).

Otherwise, you need to find the most important text on the page and take the first 100 characters or so, and also find the most noticeable image on the page.

This is not a trivial task, an extremely difficult attempt to get semantics from such a fluid and contrasting dataset (general returned web page). For example, you can find the largest image on the page, this is a good start, but how do you know that this is not a background image? How do you know the image that best describes this page?

Good luck

* If you cannot directly link a third-party AJAX URL, you can do this by requesting a page on your local server that retrieves the remote page server with some kind of HTTP request.

Some additional thoughts

If you capture an image from a remote server and “hotlink” on your site, sometimes “anti hotlinking” backlinks appear on some sites when trying to display this image, so it’s worth comparing the requested image from your server page with the actual extracted image so that you don’t accidentally showed nothing contrary.

A lot of heading tags in the head will be general and non-descriptive, it would be better to get the title of the article (provided that the site is of the type of the article), if there is one available, because it will be more descriptive, finding it difficult!

If you are really smart, you might be able to disable Google (for example, check their T&C). If a user requests a specific URL, you can do a google search behind the scenes and use google return text as return text. If google significantly changes their markup, although it can break very quickly!

+6
source

There are several APIs that can provide this functionality, for example, PageMunch allows you to pass a URL and a callback so you can do it from the client side or feed through your own server:

http://www.pagemunch.com

An example response for the BBC website is as follows:

 { "inLanguage": "en", "schema": "http:\/\/schema.org\/WebPage", "type": "WebPage", "url": "http:\/\/www.bbc.co.uk\/", "name": "BBC - Homepage", "description": "Breaking news, sport, TV, radio and a whole lot more. The BBC informs, educates and entertains - wherever you are, whatever your age.", "image": "http:\/\/static.bbci.co.uk\/wwhomepage-3.5\/1.0.64\/img\/iphone.png", "keywords": [ "BBC", "bbc.co.uk", "bbc.com", "Search", "British Broadcasting Corporation", "BBC iPlayer", "BBCi" ], "dateAccessed": "2013-02-11T23:25:40+00:00" } 
+3
source

You can always just see what is in the tag. If you need it in javascript, it should not be so complicated. Once you have the data you can do:

 var title = $(data).find('title').html(); 

The problem will be getting the data, because I think most browsers will block you from creating azase requests to the cross-site site. You can get around this by providing a service on your site that will act as a proxy server and make a request for you. However, at this point you can also analyze the header on the server. Since you did not indicate what your inner language is, I will not guess now.

+1
source

This is not possible using pure JavaScript due to the cross-domain policy - the client side of the script cannot read the contents of pages on other domains, unless that other domain explicitly exposes the JSON service.

The trick sends a request on the server side (each language on the server side has its own tools), parses the results using regular expressions or some other string parsing methods, and then uses this server-side code as a “proxy” to invoke AJAX made by " on fly "when publishing links.

+1
source

Source: https://habr.com/ru/post/1336781/


All Articles