Retrieve Google Search Results

I would like to periodically check which subdomains are listed by Google.

To get a list of subdomains, I type "site: example.com" in the Google search box - it lists all the results of the subdomains (more than 20 pages for our domain).

What is the best way to retrieve only the URLs returned by a search for 'site: example.com'?

I was thinking of writing a small python script that will do the aforementioned search and regular expression of URLs from the search results (repeat on all results pages). Is this a good start? Could there be a better methodology?

Greetings.

+3
source share
2

Regex - HTML. HTML.

BeautifulSoup Python. script, URL- 10 : domain.com Google.

import sys # Used to add the BeautifulSoup folder the import path
import urllib2 # Used to read the html document

if __name__ == "__main__":
    ### Import Beautiful Soup
    ### Here, I have the BeautifulSoup folder in the level of this Python script
    ### So I need to tell Python where to look.
    sys.path.append("./BeautifulSoup")
    from BeautifulSoup import BeautifulSoup

    ### Create opener with Google-friendly user agent
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]

    ### Open page & generate soup
    ### the "start" variable will be used to iterate through 10 pages.
    for start in range(0,10):
        url = "http://www.google.com/search?q=site:stackoverflow.com&start=" + str(start*10)
        page = opener.open(url)
        soup = BeautifulSoup(page)

        ### Parse and find
        ### Looks like google contains URLs in <cite> tags.
        ### So for each cite tag on each page (10), print its contents (url)
        for cite in soup.findAll('cite'):
            print cite.text

:

stackoverflow.com/
stackoverflow.com/questions
stackoverflow.com/unanswered
stackoverflow.com/users
meta.stackoverflow.com/
blog.stackoverflow.com/
chat.meta.stackoverflow.com/
...

, , . Python , .

+14
+3

Source: https://habr.com/ru/post/1778854/


All Articles