Return URLs when loading multiple URLs using YQL

I use YQL to get multiple pages, some of which may be offline (obviously, I don't know which of them). I am using this query:

SELECT * FROM html WHERE url IN ("http://www.whooma.net", "http://www.dfdsfsdgsfagdffgd.com", "http://www.cnn.com") 

If the first and last are actual sites and the second obviously does not exist. In fact, two results are returned, but the URL from where they were downloaded is not displayed anywhere. So, what would be the way to find out which html page belongs to the URL if each page in the request is not loaded?

+6
source share
2 answers

Unfortunately, I do not know how you can get a key pair => value in the response, where the key is the URL and the value is the html response. But you can try the following query and see if it matches your use case:

 select * from yql.query.multi where queries="select * from html where url='http://www.whooma.net';select * from feed where url='http://www.dfdsfsdgsfagdffgd.com';select * from html where url='http://www.cnn.com'" 

Try it here . What you can do before starting the query is to preserve the order in the url array in queries , for example like this ['http://www.whooma.net','http://www.dfdsfsdgsfagdffgd.com','http://www.cnn.com'] . We can call this array A When you iterate over a response from a YQL query, a url that does not exist will return null. Sample response from the above request:

 <results> <results> // Response from select * from html where url='http://www.whooma.net'. This should be some html </results> <results> // Response from select * from feed where url='http://www.dfdsfsdgsfagdffgd.com'. This should be null. </results> <results> // select * from html where url='http://www.cnn.com'. This should also be some html </results> </results> 

So in conclusion, you can iterate over array A and the response from YQL. The first element of array A must match the first element of results (internal result) of this YQL response. You create a hash map from two arrays. I know the answer is long, but I think it was necessary. Let me know if there is any confusion.

+3
source

You can determine which URLs are not loading using the YQL diagnostic flag. The diagnostics flag will cause the response to include the diagnostics property with an url array that indicates whether matching servers were found. Presumably, as soon as you eliminate the URLs that didn't load, the result pages will match the rest of the URLs.

+1
source

Source: https://habr.com/ru/post/955139/


All Articles